Regex Guru

Wednesday, 8 October 2008

R

Filed under: Regex Libraries — Jan Goyvaerts @ 17:09

R is the name of the programming language in the R project for statistical computing. It’s a bit different from the usual programming languages I work with. R primarily works on vectors, and so does its regular expression functionality. This means the new regular expression source code snippets for R in RegexBuddy 3.2.0 are a little different from those for the other languages RegexBuddy supports. But that’s a good thing. The regexp functionality in R works well with the rest of the language. The new article Regular Expressions with The R Language on regular-expressions.info has more details.

Though the regex features are tailored for R, the regex flavors are not. R supports three flavors, called basic, extended and perl in R’s documentation. I call these flavors GNU BRE, GNU ERE and PCRE. RegexBuddy supports all three, but the R source code snippets only use PCRE. PCRE is far more powerful than GNU BRE and GNU ERE. These old flavors based on the POSIX standard have the same functionality, but BRE uses a different syntax than ERE and PCRE.

Tuesday, 7 October 2008

Windows PowerShell

Filed under: Regex Libraries — Jan Goyvaerts @ 16:57

Windows PowerShell is Microsoft’s shell scripting language, based on the .NET framework. I’ve added an article to regular-expressions.info explaining how to use regular expressions with Windows PowerShell. RegexBuddy 3.2.0 now supports PowerShell strings and PowerShell source code snippets.

Since PowerShell is built on top of the .NET framework, .NET’s excellent regular expression support is also available to PowerShell programmers. PowerShell has special -match and -replace operators. These allow you to do the most common pattern matching and replacement jobs with PowerShell’s typical CmdLet and operator syntax.

For other functionality, such as splitting a string or replacing with a MatchEvaluator, you need to instantiate System.Text.RegularExpressions.Regex. One constructor of the Regex class is available through the shortcut [regex]. This constructor takes one parameter with your regex as a string. The Regex object allows you to do anything with regular expressions in PowerShell as in any other .NET language.

Tuesday, 19 August 2008

TPerlRegEx for Delphi 2009

Filed under: Regex Libraries — Jan Goyvaerts @ 12:12

TPerlRegEx is a Delphi VCL component wrapper around the open source PCRE library. I originally developed it for in-house use. It powered EditPad Pro 4 and 5, PowerGREP 1 and 2, and RegexBuddy 1. The latest versions of these products use a custom-built regular expression engine. The custom-built engine can do things such as searching through files larger than 4 GB (in PowerGREP) or emulate many regex flavors (in RegexBuddy), which aren’t typical usage scenarios for regular expressions.

The PCRE library is a great choice to add regex support to your Delphi applications. Though the library is written in C, Delphi can link the OBJ files output by a C compiler into your application. TPerlRegEx includes ready-made OBJ files, so you don’t have to worry about any of this.

The latest version of TPerlRegEx includes PCRE 7.7, with Unicode support enabled. To actually use the Unicode features, you’ll need Delphi 2009. In PerlRegEx.pas, you’ll see the following near the top of the unit:

{$IFDEF UNICODE}
type
  PCREString = UTF8String;
{$ELSE}
type
  PCREString = AnsiString;
{$ENDIF}

The UNICODE directive is defined by default in Delphi 2009, but not in Delphi 2007 or earlier. If you’re using Delphi 2007 or before, TPerlRegEx will work with AnsiString, which has been the default string type since Delphi 2.

When you migrate your application to Delphi 2009, the default string type becomes UnicodeString. PCRE does not support UTF-16. It only supports 8-bit strings (i.e. one byte per character), and UTF-8. Hence my decision to make TPerlRegEx use the new and improved UTF8String in Delphi 2009.

When you assign a varable declared as “string”, which really is UnicodeString in Delphi 2009, to a property such as TPerlRegEx.Subject, then the Delphi 2009 compiler will automatically do the UTF-16 to UTF-8 conversion for you. When you assign a property such as TPerlRegEx.MatchedExpression to a string variable, the UTF-8 to UTF-16 conversion is also automatic. The net result is that when you use TPerlRegEx and you upgrade from Delphi 2007 to Delphi 2009, your regular expressions automatically become Unicode-enabled.

The only caveat lies in position properties such as MatchedExpressionOffset and MatchedExpressionLength. These indicate byte positions in the UTF-8 strings that TPerlRegEx deals with. To make your code work correctly in Delphi 2009, use those positions with the TPerlRegEx.Subject property (which uses UTF-8) instead of your original string variable (which uses UTF-16 in Delphi 2009).

Note that in Delphi 2007, there’s no difference between UTF8String and AnsiString. Manually defining the UNICODE directive in Delphi 2007 will not make TPerlRegEx support Unicode. You’d have to add explicit calls to UTF8Encode and UTF8Decode to do the conversions.

All in all, porting TPerlRegEx from Delphi 2007 to 2009 was very easy. Changing the string declaration from AnsiString to UTF8String, and passing the PCRE_UTF8 flag to the pcre_compile function is all it really took. I wasted much more time upgrading the OBJ files from PCRE 4.5 to 7.7, which I ended up borrowing from the JCL project.

Download TPerlRegEx. Source is included under the MPL 1.1 license.

Wednesday, 23 April 2008

PCRE Library for MySQL

Filed under: Regex Libraries — Jan Goyvaerts @ 11:24

A RegexBuddy user pointed me to LIB_MYSQLUDF_PREG. This is an open source library of MySQL user functions that imports the PCRE library.

MySQL’s built-in regular expression support uses the POSIX ERE flavor. By todays standards, that flavor offers limited regex functionality. PCRE on the other hand offers all the goodies from Perl and other modern regex flavors.

If you want to work with LIB_MYSQLUDF_PREG, you’ll need to set the regex flavor to PCRE. Use the “PHP preg operator” string style when copying and pasting regular expressions. This will format regex as '/regex/' as required by LIB_MYSQLUDF_PREG.

I haven’t tried to use LIB_MYSQLUDF_PREG myself. I don’t have access to a MySQL server where I can install such libraries.

If you want RegexBuddy to generate source code snippets for LIB_MYSQLUDF_PREG, you can edit the provided MySQL template. Change the regex flavor to PCRE and the string style to PHP/preg. Then edit the functions to use the PREG_* calls instead of MySQL’s built-in operators. Save your custom template, and share it on the RegexBuddy user forum. :-)

« Previous Page