Regex Guru

Monday, 3 October 2011

Expressões Regulares Cookbook

Filed under: Regex Cookbook — Jan Goyvaerts @ 8:25

I received my author copy of “Expressões Regulares Cookbook” last week. This is the Brazilian Portuguese translation of Regular Expressions Cookbook. You can buy Expressões Regulares Cookbook from the publisher Novatec or any bookstore that sells Brazilian Portuguese language books. Ask for ISBN 978-85-7522-279-9.

Friday, 30 September 2011

한 권으로 끝내는 정규표현식

Filed under: Regex Cookbook — Jan Goyvaerts @ 8:25

I received my author copy of “한 권으로 끝내는 정규표현식” last month. This is the Korean translation of Regular Expressions Cookbook. You can buy 한 권으로 끝내는 정규표현식 from the publisher Hanbit Media, Inc. or any bookstore that sells Korean language books. Ask for ISBN 978-89-7914-774-2.

Monday, 20 December 2010

#1 O’Reilly eBook for 2010

Filed under: Regex Cookbook — Jan Goyvaerts @ 10:21

The year-end issue of O’Reilly’s author newsletter discussed the trends O’Reilly has been seeing the past few years, and their predictions for 2011. The key trend is that digital is now more than ever poised to take over print:

Our digitally distributed products have grown from 18.36% of our publishing mix in 2009 to 28.09% of our mix in 2010. What is more impressive is that our digitally distributed products have produced more than double the revenue that has been lost with the decline of print. I think this is important because some say that digital cannibalizes print products. Our data indicates the contrary, as print is declining much more slowly than digital is growing. I think we may be seeing developers purchasing a print book, and then purchasing the electronic editions to search and copying code from, as the incremental cost for digital is more than reasonable.

My own book seems to be leading this trend. Thanks to everyone who purchased it!

And the five bestselling O’Reilly ebook products for 2010: 1) Regular Expressions Cookbook, 2) jQuery Cookbook, 3) Learning Python, 4) HTML5: Up and Running, and 5) JavaScript Cookbook. I think it’s interesting that the top five ebooks are code-intensive books. They’re great products for search and code reuse.

It’s also interesting that none of the top 5 ebooks made the top 5 of print books.

Wednesday, 22 September 2010

Bug in Delphi XE RegularExpressions Unit

Filed under: Regex Trouble — Jan Goyvaerts @ 11:16

Using the new RegularExpressions unit in Delphi XE, you can iterate over all the matches that a regex finds in a string like this:

procedure TForm1.Button1Click(Sender: TObject);
var
  RegEx: TRegEx;
  Match: TMatch;
begin
  RegEx := TRegex.Create('\w+');
  Match := RegEx.Match('One two three four');
  while Match.Success do begin
    Memo1.Lines.Add(Match.Value);
    Match := Match.NextMatch;
  end
end;

Or you could save yourself two lines of code by using the static TRegEx.Match call:

procedure TForm1.Button2Click(Sender: TObject);
var
  Match: TMatch;
begin
  Match := TRegEx.Match('One two three four', '\w+');
  while Match.Success do begin
    Memo1.Lines.Add(Match.Value);
    Match := Match.NextMatch;
  end
end;

Unfortunately, due to a bug in the RegularExpressions unit, the static call doesn’t work. Depending on your exact code, you may get fewer matches than you should, or you may get blank matches, or your application may crash with an access violation.

The RegularExpressions unit defines TRegEx and TMatch as records. That way you don’t have to explicitly create and destroy them. Internally, TRegEx uses TPerlRegEx to do the heavy lifting. TPerlRegEx is a class that needs to be created and destroyed like any other class. If you look at the TRegEx source code, you’ll notice that it uses an interface to destroy the TPerlRegEx instance when TRegEx goes out of scope. Interfaces are reference counted in Delphi, making them usable for automatic memory management.

The bug is that TMatch and TGroupCollection also need the TPerlRegEx instance to do their work. TRegEx passes its TPerlRegEx instance to TMatch and TGroupCollection, but it does not pass the instance of the interface that is responsible for destroying TPerlRegEx.

This is not a problem in our first code sample. TRegEx stays in scope until we’re done with TMatch. The interface is destroyed when Button1Click exits.

In the second code sample, the static TRegEx.Match call creates a local variable of type TRegEx. This local variable goes out of scope when TRegEx.Match returns. Thus the reference count on the interface reaches zero and TPerlRegEx is destroyed when TRegEx.Match returns. When we call MatchAgain the TMatch record tries to use a TPerlRegEx instance that has already been destroyed.

To fix this bug, delete or rename the two RegularExpressions.dcu files and copy RegularExpressions.pas into your source code folder. Make these changes to both the TMatch and TGroupCollection records in this unit:

  1. Declare FNotifier: IInterface; in the private section.
  2. Add the parameter ANotifier: IInterface; to the Create constructor.
  3. Assign FNotifier := ANotifier; in the constructor’s implementation.

You also need to add the ANotifier: IInterface; parameter to the TMatchCollection.Create constructor.

Now try to compile some code that uses the RegularExpressions unit. The compiler will flag all calls to TMatch.Create, TGroupCollection.Create and TMatchCollection.Create. Fix them by adding the ANotifier or FNotifier parameter, depending on whether ARegEx or FRegEx is being passed.

With these fixes, the TPerlRegEx instance won’t be destroyed until the last TRegEx, TMatch, or TGroupCollection that uses it goes out of scope or is used with a different regular expression.

Next Page »