Regex Guru

Friday, 16 May 2014

New Regular Expression Features in Java 8

Filed under: Regex Libraries — Jan Goyvaerts @ 13:47

Java 8 brings a few changes to Java’s regular expression syntax to make it more consistent with Perl 5.14 and later in matching horizontal and vertical whitespace.

\h is a new feature. It is a shorthand character class that matches any horizontal whitespace character as defined in the Unicode standard.

In Java 4 to 7 \v is a character escape that matches only the vertical tab character. In Java 8 \v is a shorthand character class that matches any vertical whitespace, including the vertical tab. When upgrading to Java 8, make sure that any regexes that use \v still do what you want. Use \x0B or \cK to match just the vertical tab in any version of Java.

\R is also a new feature. It matches any line break as defined by the Unicode standard. Windows-style CRLF pairs are always matched as a whole. So \R matches \r\n while \R\R fails to match \r\n. \R is equivalent to (?>\r\n|[\n\cK\f\r\u0085\u2028\u2029]) with an atomic group that prevents it from matching only the CR in a CRLF pair. Oracle’s documentation for the Pattern class omits the atomic group when explaining \R, which is incorrect. You cannot use \R inside a character class.

RegexBuddy and RegexMagic have been updated to support Java 8. Java 4, 5, 6, and 7 are still supported. When you upgrade to Java 8 you can compare or convert your regular expressions between Java 8 and the Java version you were using previously.

2 Comments

  1. Have you supported character class union and intersection in Java? In version 3.5.5 it doesn’t seem to be supported. By the way, the reference implementation has some bugs regarding nested character class and negation – not sure how you are going to implement it.

    Comment by nhahtdh — Thursday, 12 June 2014 @ 23:26

  2. RegexBuddy 4 supports all the regex features in Java 4 to 8, including character class union and intersection. RegexBuddy 4 emulates the behavior of Oracle’s Java SE VM, including the way it deals with character class union and intersection inside negated classes. In case of outright bugs that can’t be emulated, RegexBuddy treats that part of the regex syntax as an error saying that Java does not correctly support this feature.

    Comment by Jan Goyvaerts — Monday, 16 June 2014 @ 10:13

Sorry, the comment form is closed at this time.