Regex Guru

Thursday, 29 May 2014

What’s New in Delphi XE6 Regular Expressions

Filed under: Regex Libraries — Jan Goyvaerts @ 13:27

There’s not much new in the regular expression support in Delphi XE6. The big change that should be made, upgrading to PCRE 8.30 or later and switching to the pcre16 functions that use UTF-16, still hasn’t been made. XE6 still uses PCRE 7.9 and thus continues to require conversion from the UTF-16 strings that Delphi uses natively to the UTF-8 strings that older versions of PCRE require.

Delphi XE6 does fix one important issue that has plagued TRegEx since it was introduced in Delphi XE. Previously, TRegEx could not find zero-length matches. So a regex like (?m)^ that should find a zero-length match at the start of each line would not find any matches at all with TRegEx.

The reason for this is that TRegEx uses TPerlRegEx to do the heavy lifting. TPerlRegEx sets its State property to [preNotEmpty] in its constructor, which tells it to skip zero-length matches. This is not a problem with TPerlRegEx because users of this class can change the State property. But TRegEx does not provide a way to change this property. So in Delphi XE5 and prior, TRegEx cannot find zero-length matches.

In Delphi XE6 TPerlRegEx’s constructor was changed to initialize State to the empty set. This means TRegEx is now able to find zero-length matches. TRegex.Replace() using the regex (?m)^ now inserts the replacement at the start of each line, as you would expect. If you use TPerlRegEx directly, you’ll need to set State to [preNotEmpty] in your own code if you relied on its behavior to skip zero-length matches.

You will need to check existing applications that use TRegEx for regular expressions that incorrectly allow zero-length matches. In XE5 and prior, TRegEx using \d* would match all numbers in a string. In XE6, the same regex still matches all numbers, but also finds a zero-length match at each position in the string.

RegexBuddy 4 warns about zero-length matches on the Create panel if you set it to Detailed mode. At the bottom of the regex tree there will be a node saying either “your regular expression may find zero-length matches” or “zero-length matches will be skipped” depending on whether your application allows zero-length matches (XE6 TRegEx) or not (XE–XE5 TRegEx).

Monday, 19 May 2014

Python 3.4 adds re.fullmatch()

Filed under: Regex Libraries — Jan Goyvaerts @ 13:56

Python 3.4 does not bring any changes to its regular expression syntax compared to previous 3.x releases. It does add one new function to the re module called fullmatch(). This function takes a regular expression and a subject string as its parameters. It returns a Match object if the regular expression can match the string entirely. It returns None if the string cannot be matched or if it can only be matched partially. This is useful when using a regular expression to validate user input. In Boolean context like an if statement, the Match object evaluates to True, while None evaluates to False.

Do note that fullmatch() will evaluate to True if the subject string is the empty string and the regular expression can find zero-length matches. A zero-length match of a zero-length string is a complete match. So if you want to check whether the user entered a sequence of digits, use \d+ rather than \d* as the regex.

Friday, 16 May 2014

New Regular Expression Features in Java 8

Filed under: Regex Libraries — Jan Goyvaerts @ 13:47

Java 8 brings a few changes to Java’s regular expression syntax to make it more consistent with Perl 5.14 and later in matching horizontal and vertical whitespace.

\h is a new feature. It is a shorthand character class that matches any horizontal whitespace character as defined in the Unicode standard.

In Java 4 to 7 \v is a character escape that matches only the vertical tab character. In Java 8 \v is a shorthand character class that matches any vertical whitespace, including the vertical tab. When upgrading to Java 8, make sure that any regexes that use \v still do what you want. Use \x0B or \cK to match just the vertical tab in any version of Java.

\R is also a new feature. It matches any line break as defined by the Unicode standard. Windows-style CRLF pairs are always matched as a whole. So \R matches \r\n while \R\R fails to match \r\n. \R is equivalent to (?>\r\n|[\n\cK\f\r\u0085\u2028\u2029]) with an atomic group that prevents it from matching only the CR in a CRLF pair. Oracle’s documentation for the Pattern class omits the atomic group when explaining \R, which is incorrect. You cannot use \R inside a character class.

RegexBuddy and RegexMagic have been updated to support Java 8. Java 4, 5, 6, and 7 are still supported. When you upgrade to Java 8 you can compare or convert your regular expressions between Java 8 and the Java version you were using previously.

Wednesday, 14 May 2014

New regular expression features in PCRE 8.34 and 8.35

Filed under: Regex Libraries — Jan Goyvaerts @ 13:26

PCRE 8.34 adds some new regex features and changes the behavior of a few to make it better compatible with the latest versions of Perl. There are no changes to the regex syntax in PCRE 8.35.

\o{377} is now an octal escape just like \377. This syntax was first introduced in Perl 5.12. It avoids any confusion between octal escapes and backreferences. It also allows octal numbers beyond 377 to be used. E.g. \o{400} is the same as \x{100}. If you have any reason to use octal escapes instead of hexadecimal escapes then you should definitely use the new syntax. Because of this change, \o is now an error when it doesn’t form a valid octal escape. Previously \o was a literal o and \o{377} was a sequence of 337 o‘s.

In free-spacing mode, whitespace between a quantifier and the ? that makes it lazy or the + that makes it possessive is now ignored. In Perl this has always been the case. In PCRE 8.33 and prior, whitespace ended a quantifier and any following ? or + was seen as a second quantifier and thus an error.

The shorthand \s now matches the vertical tab character in addition to the other whitespace characters it previously matched. Perl 5.18 made the same change. Many other regex flavors have always included the vertical tab in \s, just like POSIX has always included it in [[:space:]].

Names of capturing groups are no longer allowed to start with a digit. This has always been the case in Perl since named groups were added to Perl 5.10. PCRE 8.33 and prior even allowed group names to consist entirely of digits.

[[:<:]] and [[:>:]] are now treated as POSIX-style word boundaries. They match at the start and the end of a word. Though they use similar syntax, these have nothing to do with POSIX character classes and cannot be used inside character classes. Perl does not support POSIX word boundaries.

The same changes affect PHP 5.5.10 (and later) and R 3.0.3 (and later) as they have been updated to use PCRE 8.34.

RegexBuddy and RegexMagic have been updated to support the latest versions of PCRE, PHP, and R. Older versions that were previously supported are still supported, so you can compare or convert your regular expressions between the latest versions of PCRE, PHP, and R and whichever version you were using previously.

Next Page »