TPerlRegEx is a Delphi VCL component wrapper around the open source PCRE library. I originally developed it for in-house use. It powered EditPad Pro 4 and 5, PowerGREP 1 and 2, and RegexBuddy 1. The latest versions of these products use a custom-built regular expression engine. The custom-built engine can do things such as searching through files larger than 4 GB (in PowerGREP) or emulate many regex flavors (in RegexBuddy), which aren’t typical usage scenarios for regular expressions.
The PCRE library is a great choice to add regex support to your Delphi applications. Though the library is written in C, Delphi can link the OBJ files output by a C compiler into your application. TPerlRegEx includes ready-made OBJ files, so you don’t have to worry about any of this.
The latest version of TPerlRegEx includes PCRE 7.7, with Unicode support enabled. To actually use the Unicode features, you’ll need Delphi 2009. In PerlRegEx.pas, you’ll see the following near the top of the unit:
{$IFDEF UNICODE}
type
PCREString = UTF8String;
{$ELSE}
type
PCREString = AnsiString;
{$ENDIF}
The UNICODE directive is defined by default in Delphi 2009, but not in Delphi 2007 or earlier. If you’re using Delphi 2007 or before, TPerlRegEx will work with AnsiString, which has been the default string type since Delphi 2.
When you migrate your application to Delphi 2009, the default string type becomes UnicodeString. PCRE does not support UTF-16. It only supports 8-bit strings (i.e. one byte per character), and UTF-8. Hence my decision to make TPerlRegEx use the new and improved UTF8String in Delphi 2009.
When you assign a varable declared as “string”, which really is UnicodeString in Delphi 2009, to a property such as TPerlRegEx.Subject, then the Delphi 2009 compiler will automatically do the UTF-16 to UTF-8 conversion for you. When you assign a property such as TPerlRegEx.MatchedExpression to a string variable, the UTF-8 to UTF-16 conversion is also automatic. The net result is that when you use TPerlRegEx and you upgrade from Delphi 2007 to Delphi 2009, your regular expressions automatically become Unicode-enabled.
The only caveat lies in position properties such as MatchedExpressionOffset and MatchedExpressionLength. These indicate byte positions in the UTF-8 strings that TPerlRegEx deals with. To make your code work correctly in Delphi 2009, use those positions with the TPerlRegEx.Subject property (which uses UTF-8) instead of your original string variable (which uses UTF-16 in Delphi 2009).
Note that in Delphi 2007, there’s no difference between UTF8String and AnsiString. Manually defining the UNICODE directive in Delphi 2007 will not make TPerlRegEx support Unicode. You’d have to add explicit calls to UTF8Encode and UTF8Decode to do the conversions.
All in all, porting TPerlRegEx from Delphi 2007 to 2009 was very easy. Changing the string declaration from AnsiString to UTF8String, and passing the PCRE_UTF8 flag to the pcre_compile function is all it really took. I wasted much more time upgrading the OBJ files from PCRE 4.5 to 7.7, which I ended up borrowing from the JCL project.
Download TPerlRegEx. Source is included under the MPL 1.1 license.