Regex Guru

Thursday, 18 December 2008

TPerlRegEx Now with Proper UTF-8 (Unicode) Support

Filed under: Regex Libraries — Jan Goyvaerts @ 16:36

Five months ago I released TPerlRegEx for Delphi 2009 which enables PCRE‘s UTF-8 support when compiled with Delphi 2009, so you can use TPerlRegEx with Unicode strings. TPerlRegEx still supports Delphi 2007 and earlier using Ansi strings.

Unfortunatety, until today’s new release, TPerlRegEx for Delphi 2009 had a rather embarrasing bug: it didn’t actually enable the UTF-8 support in PCRE if you did not set the Options property to something different than the default. That means SetOptions doesn’t get called, which was the only spot where TPerlRegEx set the PCRE_UTF8 flag.

The new release of TPerlRegEx fixes this by adding these five lines to the constructor:

{$IFDEF UNICODE}
  pcreOptions := PCRE_UTF8 or PCRE_NEWLINE_ANY;
{$ELSE}
  pcreOptions := PCRE_NEWLINE_ANY;
{$ENDIF}

This makes sure TPerlRegEx tells PCRE to use UTF8 when TPerlRegex uses UTF8String.

When using the buggy version of TPerlRegEx with Delphi 2009, this subject:

PerlRegEx1.Subject := '€';

will match ^.{3}$ but not ^.$.

When using the corrected version of TPerlRegEx, ^.{3}$ fails while ^.$ matches.

The reason is that in UTF-8, the euro symbol is encoded as three bytes. When PCRE operates in UTF-8 mode, the dot matches one Unicode code point, regardless of how many bytes that code point is encoded with in UTF-8. (All code points take between 1 and 4 bytes.) But when the PCRE_UTF8 flag is not set, PCRE operates in 8-bit mode, and the dot always matches one byte.

In Delphi 2007 and prior, ^.$ also mathes the euro symbol, because it takes up one byte in the Windows (Ansi) code pages.

If you download TPerlRegEx again, it’ll properly handle Unicode in Delphi 2009.

3 Comments

  1. When I install a Error occured(My Delphi is Delphi 2009 Update1).

    [DCC Warning] PerlRegEx.pas(265): W1063 Widening given AnsiChar constant (#$B7) to WideChar lost information
    [DCC Error] PerlRegEx.pas(265): E2030 Duplicate case label
    [DCC Fatal Error] PerlRegExD2009.dpk(35): F2063 Could not compile used unit ‘PerlRegEx.pas’

    PerlRegEx.pas(Line 265):
    #0..’&’, ‘(‘, ‘*’, ‘+’, ‘,’, ‘-‘, ‘.’, ‘<‘, ‘[‘, ‘{‘, ‘?:

    I have make the follow change:
    #0..’&’, ‘(‘, ‘*’, ‘+’, ‘,’, ‘-‘, ‘.’, ‘<‘, ‘[‘, ‘{‘, ‘?’:

    Comment by QFly — Friday, 26 December 2008 @ 20:46

  2. To fix this bug, replace the last item in the case statement on line 265 with #$00B7. I’ve updated the TPerlRegEx download with this change.

    I also wrote a blog post titled Choose The Right File Format for Your Delphi Source Code that explains why this error occurs on your system but not on mine.

    Comment by Jan Goyvaerts — Sunday, 28 December 2008 @ 12:41

  3. Thanks for you Excellent work.
    And thanks for you answer.

    Sorry for my poor english.I’ll read your post carefully to find out what relly happened.

    Comment by QFly — Monday, 29 December 2008 @ 20:20

Sorry, the comment form is closed at this time.