Regex Guru

Wednesday, 27 May 2009

Regular Expressions Cookbook Has Been Published

Filed under: Regex Cookbook — Jan Goyvaerts @ 11:29

I just heard from my editor at O’Reilly that Regular Expressions Cookbook, written by Steven Levithan and me, is now shipping. When I write this, Amazon.com still lists as a pre-order with a June 4 release date. I’m sure they’ll fulfill all pre-orders by that date, so don’t wait with buying your own copy! If you use one of the links below to order, Amazon will pay me an affiliate commission, which will pretty much double the slice I get from what you pay for the book. I’ve listed the names of the countries to which Amazon offers free shipping. The list price of the book is US$ 44.99. The various Amazon sites offer different discounts, up to 34%, with free shipping to various countries:

The book covers the regular the regular expression flavors .NET, Java, JavaScript, Perl, PCRE, Python, and Ruby, and the programming languages C#, Java, JavaScript, Perl, PHP, Python, Ruby, and VB.NET. After a quick introduction, the book starts with a detailed regular expression tutorial which equally covers all 7 regex flavors. That chapter is followed by a detailed guide how to implement regular expressions in your source code, again covering the 8 programming languages equally. These chapters too are presented in cookbook format. You can easily pick out the task you want to accomplish when creating a regular expression of your own, and when you want to do something with a regex in your source code. While there’s some repetition, particularly in the programming guide, because of our goal of equal coverage, the benefit is that you can easily skip the parts on programming languages you’re not interested in, in true cookbook style.

The remaining chapters, over half of the book, present real-world problems, and how to solve them with regular expressions. These problems range from very simple problems and everyday regex tasks, to some complex problems that stretch the limits of what you can do with regular expressions, but show how a regex-based solution is often much quicker than doing the same in procedural code, particularly if you only need to do the job once. All the real-world problems also have solutions for all regex flavors. A few solutions add procedural code to make up for missing regex features, such as JavaScript lacking lookbehind. The book does not cover regex flavors with limited features such as the venerable POSIX standard. We didn’t want to put those flavors on the cover and then disappoint readers by saying “can’t be done with this limited flavor” for half of the recipes in the book.

If you use regular expressions with the latest versions of PowerGREP, EditPad Pro or AceText, you can use pretty much any of the regular expressions presented in the book. These products use a custom regex flavor that is a fusion of the features found in the flavors covered in Regular Expression Cookbook. RegexBuddy 3 emulates all the flavors in the book. Older versions of these products are based on PCRE 4, without Unicode support.

Regular Expression Cookbook targets people with regular expression skills ranging from zero to upper intermediate, who want to learn about regular expressions for the first time, or sharpen their skills to become experts. Except for the chapter on programming languages, most of the recipes in the book don’t require programming skills to implement the solutions in EditPad Pro, PowerGREP, or any other text editor or search tool that uses one of the book’s regular expression flavors. The programming chapter assumes you’re familiar with all the basic features and syntax of your programming language, but it doesn’t assume you’ve ever used regular expressions in your source code.

While the jury has only just received their copies of the book, I dare say that Regular Expression Cookbook is the most practical book on regular expressions to date, filled with lots of detailed information about flavor-specific and language-specific features or issues glossed over by many other books and online articles.

Monday, 27 April 2009

Split() is Not Always The Best Way to Split a String

Filed under: Regex Trouble — Jan Goyvaerts @ 16:17

Most programming languages have a split() function that takes a regular expression and a string. It returns an array with the parts of the string between the regex matches. The regex matches themselves are discarded.

The split() function is great when it’s easy to write a regular expression to match the delimiters. You can easily split a string along commas: split(/,/, subject). In fact, you don’t even need a regex for this. Splitting on an (X)HTML break tag does require a regex, but is still very easy: <br \s*/?>.

The split() function is terrible when the delimiters can occur in the split content. A common job is to split a string along commas, except when those commas appear in double quotes. Such a string might be a line in a CSV file. Every now and then on the RegexBuddy user forum somebody asks for help with fixing some hideous contraption of lookahead and lookbehind that is intended to match an unquoted comma.

In such cases, it is much easier to write a regex that matches the content you want to keep in the array, and use findall() instead of split(). Instead of writing a regex that matches what you want split() to throw away, you write a regex that matches what findall() should keep. While writing a regex to match unquoted commas is hard, writing one for CSV fields is quite straightforward:

“[^"]*”|[^,]+

This regex matches a pair of double quotes with anything except double quotes between them, or a series of characters that don’t include a comma. Those are the two forms that fields can use in CSV files. If you want to allow escaped quotes in quoted fields, this regex is a bit longer but very efficient:

“[^"\\]*(?:\\.[^"\\]*)*”|[^,]+

Both these regular expressions assume that a field that begins with a double quote also ends with one.

Another question I got recently was how to split text on semicolons, except for semicolons that are part of an HTML entity such as &amp;, using a regex flavor that does not support lookbehind. Again, the easiest way to solve this problem is to turn it onto its head. Instead of figuring out a way to exclude semicolons that are part of an entity without using lookbehind, just write a regex that matches text with entities:

[^&;]+(?:&\w+;[^&;]*)

This regular expression assumes that ampersands only occur in the text as part of entities. It matches any text excluding ampersands and semicolons, optionally followed by an entity and more text without ampersands and semicolons.

Thursday, 12 March 2009

Regular Expressions Ratatouille

Filed under: Regex Cookbook — Jan Goyvaerts @ 16:13

The Amazon.com page for Regular Expressions Cookbook now shows the book’s cover. That cover is actually a draft. The final cover will be (slightly) different.

When our editor sent the cover for comment to Steve and me, I wrote:

I’m concerned mostly that neither of you have a hatred of rats.

I replied:

The rat in Ratatouille is pretty cool.

I guess the cover was designed during the year of the rat, which was last year, when we wrote the book. This year is the year of the ox.

In Thai, the word for mouse or rat is commonly used as a first person pronoun by children referring to themselves when addressing their elders, and as a 2nd or 3rd person pronoun by elders referring to youngsters.

So, no, I don’t have any cultural issues with rats.

Steve didn’t have any issues with the rat either. So if you buy our book, your regular expressions will be free of bugs, but cooked up by a rat pulling your hair! :-)

Monday, 2 February 2009

From Regex Newbie to Regex Guru

Filed under: About Regex Guru — Jan Goyvaerts @ 11:28

One of my last tasks for the Regular Expressions Cookbook was to write the preface, including my author bio. I told the story of how I went from my first real encounter with regular expressions in 2000, to the expert I am almost a decade later.

My first attempt at writing the bio came out way too long compared with the other sections in the preface. The final bio is only half as long. Rather than let the long bio go to waste, I’m publishing it here, with added links.

In 1996, fresh out of high school, Jan Goyvaerts started a hobby project publishing his own software on his own website. It was less than a year before that Internet access had become available at local call rates to his Belgian hometown. In 1999, he decided that a university degree was only a ticket to joining the rat race, and focused on his ever more successful software development venture. He set up the business that would eventually become Just Great Software in 2000.

At that time, Jan had no idea he would ever become an expert on regular expressions. One of his early successes was a postcardware text editor called EditPad. Since postcards don’t pay the bills, he developed a commercial text editor called EditPad Pro. EditPad Pro, released mid-2000, needed regular expression support to compete. Only the best regex engine would do for “Just Great Software”. Jan decided to go with PCRE.

The regular expression features in EditPad Pro proved quite popular, particularly because PCRE offered a regex syntax compatible with Perl, which was all the rage. Most other text editors, even the big IDEs from Microsoft and Borland, had much simpler regular expression support. (In 2009, Visual Studio and Delphi still use those same old regex flavors. This frustrates Jan, because old and limited regex flavors don’t make good book material, and both IDEs are built on .NET, which provides a very rich regex flavor fully covered in this book.)

Sensing a need for more powerful tools for working with regular expressions on text files, Jan developed PowerGREP. PowerGREP took a slow start in late 2002. Today, it is clearly the most powerful tool on the Microsoft Windows platform for doing anything with regular expressions. One of the differentiating features early on was the inclusion of a detailed regular expression tutorial in the help file. Most other grep tools had only help topic listing all the syntax features that you could print on a single sheet of paper. PowerGREP had a separate detailed help topic for every feature in PCRE.

Jan didn’t have a big budget to advertise his software. With many internet marketers preaching that content is king on the search engines, Jan set up http://www.regular-expressions.info with the text from tutorial he had already written for PowerGREP. As he watched the site’s traffic and Google rank rise, ultimately beating the Wikipedia entry at the top, Jan started getting the idea that maybe this could become his area of expertise. Writing his own regular expression engine was still a scary thought.

Regular expressions hit the mainstream development community when .NET was released including a set of powerful regex classes. Not much later the Java platform added the same with the JDK 1.4 release. Seeing lots of Windows developers using regular expressions, and a customer base of EditPad Pro and PowerGREP users needing to test their regular expressions, Jan felt there was a need for a comprehensive tool to create, test, and edit regular expressions. RegexBuddy was released in 2004.

The PCRE engine which had been such a blessing in 2000 was now seriously limiting both PowerGREP and RegexBuddy. PowerGREP needed to search files larger than 2 GB, and RegexBuddy needed to be compatible with all major regex flavors, not just PCRE. Jan bit the bullet and sweat several months implementing a brand new regular expression engine. The result was a fusion regex flavor that supports almost all the features found in all the regex flavors discussed in this book, and that was fast and flexible enough to meet the needs of PowerGREP’s customers. The new regex engine made the 2005 releases of PowerGREP and RegexBuddy very successful.

By this time, Jan had become very aware of the differences between all the regular expression flavors. While RegexBuddy could now emulate nearly all the abilities of the popular regular expression flavors, it could not emulate their deficiencies. After much research and testing, Jan released RegexBuddy 3 in 2007 which can emulate the features, and lack thereof, of 15 different regular expression flavors.

Having spent so much time researching regular expressions, Jan felt he was ready to write the book on regular expressions. But he didn’t actually set out to do it. It was Steven Levithan, a very enthusiastic RegexBuddy user, who asked him early 2008 if he wanted to co-write a book on regular expressions. Jan hesitated at first, books being much less profitable than software. After some reflection, he decided he would realize his childhood dream of seeing his name in print, before the printed book becomes obsolete.

The result will be published in May 2009. Enjoy.

Meanwhile, Jan has left cloudy Belgium for tropical Thailand. He now lives with his wife in Phuket, where he enjoys pretending to be a tourist, even though in reality he still spends far too much time flipping the switches on his DataHand.

« Previous PageNext Page »