Regex Guru

Thursday, 29 May 2014

What’s New in Delphi XE6 Regular Expressions

Filed under: Regex Libraries — Jan Goyvaerts @ 13:27

There’s not much new in the regular expression support in Delphi XE6. The big change that should be made, upgrading to PCRE 8.30 or later and switching to the pcre16 functions that use UTF-16, still hasn’t been made. XE6 still uses PCRE 7.9 and thus continues to require conversion from the UTF-16 strings that Delphi uses natively to the UTF-8 strings that older versions of PCRE require.

Delphi XE6 does fix one important issue that has plagued TRegEx since it was introduced in Delphi XE. Previously, TRegEx could not find zero-length matches. So a regex like (?m)^ that should find a zero-length match at the start of each line would not find any matches at all with TRegEx.

The reason for this is that TRegEx uses TPerlRegEx to do the heavy lifting. TPerlRegEx sets its State property to [preNotEmpty] in its constructor, which tells it to skip zero-length matches. This is not a problem with TPerlRegEx because users of this class can change the State property. But TRegEx does not provide a way to change this property. So in Delphi XE5 and prior, TRegEx cannot find zero-length matches.

In Delphi XE6 TPerlRegEx’s constructor was changed to initialize State to the empty set. This means TRegEx is now able to find zero-length matches. TRegex.Replace() using the regex (?m)^ now inserts the replacement at the start of each line, as you would expect. If you use TPerlRegEx directly, you’ll need to set State to [preNotEmpty] in your own code if you relied on its behavior to skip zero-length matches.

You will need to check existing applications that use TRegEx for regular expressions that incorrectly allow zero-length matches. In XE5 and prior, TRegEx using \d* would match all numbers in a string. In XE6, the same regex still matches all numbers, but also finds a zero-length match at each position in the string.

RegexBuddy 4 warns about zero-length matches on the Create panel if you set it to Detailed mode. At the bottom of the regex tree there will be a node saying either “your regular expression may find zero-length matches” or “zero-length matches will be skipped” depending on whether your application allows zero-length matches (XE6 TRegEx) or not (XE–XE5 TRegEx).

2 Comments »

  1. I am looking for the fastest regular expression processor in the universe.

    These guys I work with run about 100 regular expressions per file in our workflow using stuff they wrote in Microsoft C# “Dismal” Studio.

    It is like their processor does one replacement per file I/0… like one at a time… like it reads in the file, does one transformation, writes the output, and then does it all over again for the next expression.

    Moooooooooooolasses. Drip.

    I want 1-I/0.

    That is one input, massive expression processing, and one output. Done.
    Wham-bam, thank you ma’am. Its Miller time!

    Thoughts?

    Comment by John Hoffman — Friday, 30 May 2014 @ 7:03

  2. I don’t know if it’s the fastest in the universe (there’s many planets I haven’t been to yet), but our product PowerGREP can do a search-and-replace using 100 or more regular expressions while reading and writing each file only once. Set “action type” to “search-and-replace”, set “search type” to “delimited regular expressions” or “list of regular expressions”, and make sure “non-overlapping search” is turned on. Then add all your regexes on the Action panel. The Quick Replace button will do the search-and-replace the fastest, but you may want to start with Preview or Replace so you can inspect the results.

    Comment by Jan Goyvaerts — Friday, 30 May 2014 @ 15:08

TrackBack URL

Leave a comment

Note: comments are moderated, so your comment will not appear instantly.