Regex Guru

Wednesday, 22 September 2010

Bug in Delphi XE RegularExpressions Unit

Filed under: Regex Trouble — Jan Goyvaerts @ 11:16

Using the new RegularExpressions unit in Delphi XE, you can iterate over all the matches that a regex finds in a string like this:

procedure TForm1.Button1Click(Sender: TObject);
var
  RegEx: TRegEx;
  Match: TMatch;
begin
  RegEx := TRegex.Create('\w+');
  Match := RegEx.Match('One two three four');
  while Match.Success do begin
    Memo1.Lines.Add(Match.Value);
    Match := Match.NextMatch;
  end
end;

Or you could save yourself two lines of code by using the static TRegEx.Match call:

procedure TForm1.Button2Click(Sender: TObject);
var
  Match: TMatch;
begin
  Match := TRegEx.Match('One two three four', '\w+');
  while Match.Success do begin
    Memo1.Lines.Add(Match.Value);
    Match := Match.NextMatch;
  end
end;

Unfortunately, due to a bug in the RegularExpressions unit, the static call doesn’t work. Depending on your exact code, you may get fewer matches than you should, or you may get blank matches, or your application may crash with an access violation.

The RegularExpressions unit defines TRegEx and TMatch as records. That way you don’t have to explicitly create and destroy them. Internally, TRegEx uses TPerlRegEx to do the heavy lifting. TPerlRegEx is a class that needs to be created and destroyed like any other class. If you look at the TRegEx source code, you’ll notice that it uses an interface to destroy the TPerlRegEx instance when TRegEx goes out of scope. Interfaces are reference counted in Delphi, making them usable for automatic memory management.

The bug is that TMatch and TGroupCollection also need the TPerlRegEx instance to do their work. TRegEx passes its TPerlRegEx instance to TMatch and TGroupCollection, but it does not pass the instance of the interface that is responsible for destroying TPerlRegEx.

This is not a problem in our first code sample. TRegEx stays in scope until we’re done with TMatch. The interface is destroyed when Button1Click exits.

In the second code sample, the static TRegEx.Match call creates a local variable of type TRegEx. This local variable goes out of scope when TRegEx.Match returns. Thus the reference count on the interface reaches zero and TPerlRegEx is destroyed when TRegEx.Match returns. When we call MatchAgain the TMatch record tries to use a TPerlRegEx instance that has already been destroyed.

To fix this bug, delete or rename the two RegularExpressions.dcu files and copy RegularExpressions.pas into your source code folder. Make these changes to both the TMatch and TGroupCollection records in this unit:

  1. Declare FNotifier: IInterface; in the private section.
  2. Add the parameter ANotifier: IInterface; to the Create constructor.
  3. Assign FNotifier := ANotifier; in the constructor’s implementation.

You also need to add the ANotifier: IInterface; parameter to the TMatchCollection.Create constructor.

Now try to compile some code that uses the RegularExpressions unit. The compiler will flag all calls to TMatch.Create, TGroupCollection.Create and TMatchCollection.Create. Fix them by adding the ANotifier or FNotifier parameter, depending on whether ARegEx or FRegEx is being passed.

With these fixes, the TPerlRegEx instance won’t be destroyed until the last TRegEx, TMatch, or TGroupCollection that uses it goes out of scope or is used with a different regular expression.

Wednesday, 15 September 2010

Regular Expressions Cookbook Ebook Deal of the Day

Filed under: Regex Cookbook — Jan Goyvaerts @ 9:20

Every day O’Reilly has an “ebook deal of the day” offering one or a bunch of their books in electronic format for only $9.99. Twice this year I received an email from O’Reilly notifying me that Regular Expressions Cookbook was on sale. But each time the email was sent on the morning of the day itself. When it’s morning in California it’s already bedtime for me here in Thailand. So I never saw the emails until the next day, making it rather pointless to blog about the deal.

But this time O’Reilly has listened to my request for advance notification. I just got an email this morning saying Regular Expressions Cookbook will be part of the Ebook Deal of the Day for 15 September 2010. That’s 15 September on the US west coast. When I write this there’s a few hours to go before the deal starts at one past midnight California time. You can get any O’Reilly Cookbook as an ebook for only $9.99. The normal price for Regular Expressons Cookbook as an ebook is $31.99. The download includes the book in PDF, ePub, Mobi (for Kindle), DAISY, and Android formats.

Tuesday, 7 September 2010

PerlRegEx vs RegularExpressionsCore Delphi Units

Filed under: Regex Libraries — Jan Goyvaerts @ 8:13

The RegularExpressionsCore unit that is part of Delphi XE is based on the latest class-based PerlRegEx unit that I developed. Embarcadero only made a few changes to the unit. These changes are insignificant enough that code written for earlier versions of Delphi using the class-based PerlRegEx unit will work just the same with Delphi XE.

  • The unit was renamed from PerlRegEx to RegularExpressionsCore. When migrating your code to Delphi XE, you can choose whether you want to use the new RegularExpressionsCore unit or continue using the PerlRegEx unit in your application. All you need to change is which unit you add to the uses clause in your own units.
  • Indentation and line breaks in the code were changed to match the style used in the Delphi RTL and VCL code. This does not change the code, but makes it harder to diff the two units.
  • Literal strings in the unit were separated into their own unit called RegularExpressionsConsts. These strings are only used for error messages that indicate bugs in your code. If your code uses TPerlRegEx correctly then the user should not see any of these strings.
  • My code uses assertions to check for out of bounds parameters, while Embarcadero uses exceptions. Again, if you use TPerlRegEx correctly, you should never get any assertions or exceptions. The Compile method raises an exception if the regular expression is invalid in both my original TPerlRegEx component and Embarcadero’s version. If your code allows the user to provide the regular expression, you should explicitly call Compile and catch any exceptions it raises so you can tell the user there is a problem with the regular expression. Even with user-provided regular expressions, you shouldn’t get any other assertions or exceptions if your code is correct.

Note that Embarcadero owns all the rights to their RegularExpressionsCore unit. Like all the other RTL and VCL units, this unit cannot be distributed by myself or anyone other than Embarcadero. I do retain the rights to my original PerlRegEx unit which I will continue to make available for those using older versions of Delphi.

Monday, 6 September 2010

New TPerlRegEx Compatible with Delphi XE

Filed under: Regex Libraries — Jan Goyvaerts @ 15:08

The new RegularExpressionsCore unit in Delphi XE is based on the PerlRegEx unit that I wrote many years ago. Since I donated full rights to a copy rather than full rights to the original, I can continue to make my version of TPerlRegEx available to people using older versions of Delphi. I did make a few changes to the code to modernize it a bit prior to donating a copy to Embarcadero. The latest TPerlRegEx includes those changes. This allows you to use the same regex-based code using the RegularExpressionsCore unit in Delphi XE, and the PerlRegEx unit in Delphi 2010 and earlier.

If you’re writing new code using regular expressions in Delphi 2010 or earlier, I strongly recomment you use the new version of my PerlRegEx unit. If you later migrate your code to Delphi XE, all you have to do is replace PerlRegEx with RegularExrpessionsCore in the uses clause of your units.

If you have code written using an older version of TPerlRegEx that you want to migrate to the latest TPerlRegEx, you’ll need to take a few changes into account.

The original TPerlRegEx was developed when Borland’s goal was to have a component for everything on the component palette. So the old TPerlRegEx derives from TComponent, allowing you to put it on the component palette and drop it on a form.

The new TPerlRegEx derives from TObject. It can only be instantiated at runtime. If you want to migrate from an older version of TPerlRegEx to the latest TPerlRegEx, start with removing any TPerlRegEx components you may have placed on forms or data modules and instantiate the objects at runtime instead. When instantiating at runtime, you no longer need to pass an owner component to the Create() constructor. Simply remove the parameter.

Some of the property and method names in the original TPerlRegEx were a bit unwieldy. These have been renamed in the latest TPerlRegEx. Essentially, in all identifiers SubExpression was replaced with Group and MatchedExpression was replaced with Matched. Here is a complete list of the changed identifiers:

Old Identifier New Identifier
StoreSubExpression StoreGroups
NamedSubExpression NamedGroup
MatchedExpression MatchedText
MatchedExpressionLength MatchedLength
MatchedExpressionOffset MatchedOffset
SubExpressionCount GroupCount
SubExpressions Groups
SubExpressionLengths GroupLengths
SubExpressionOffsets GroupOffsets

Download TPerlRegEx. Source is included under the MPL 1.1 license.