Regex Guru

Friday, 19 December 2008

Don’t Escape Literal Characters That Aren’t Metacharacters

Filed under: Regex Trouble — Jan Goyvaerts @ 17:29

Perl-style regular expressions treat 12 punctuation characters as metacharacters outside character classes. These characters need to be escaped with a backslash if you want to include them as literal characters in your regex:

.^$|*+?()[{\

Inside character classes, these flavors treat a different set of 4 punctuation characters as metacharacters. Only those 4 need to be escaped to be included literally in character classes:

]^-\

The POSIX ERE flavor, which Perl derives from, has strict rules about escaping characters with a backslash. Outside character classes, only metacharacters may be escaped. Escaping anything else is an error. Inside character classes, POSIX ERE treats the backslash as a literal, so you can’t use it to escape anything. Clever placement as in []^-] is then you’re only option to include the other 3 metacharacters.

Perl is more flexible. It allows all punctuation characters to be escaped. Only escaping letters that don’t create something with a special meaning is an error. E.g. \b is a word boundary, while \J is an error.

Thus, in Perl, the regular expressions &[lg]t; and \&[lg]t\; are equivalent, and a lot of developers use the second variant in their code. It seems a lot of people like to escape punctuation “just in case”. Don’t needlessly escape literal characters. It’s a bad habit with several bad effects.

It makes you look like a newbie when you don’t know which characters really need to be escaped, and which don’t.

A few extra backslashes can quickly grow into a forest of backslashes. Those two regular expression, when included as strings in Java source code, become "&[lg]t;” and "\\&[lg]t\\;”. Imagine a long regex with lots of backslashes, some requried, some superfluous, but all doubled up. The regular expression syntax is difficult to read as it is. Don’t make it even more complicated.

But most importantly: you, or other people copying your regex, may run into flavor-specific issues when escaping certain literals. A regex that works fine in Perl may fail in .NET or on the command line with egrep (which uses POSIX ERE).

As I’ve mentioned, Perl doesn’t allow literal letters to be escaped. The .NET developers got this wrong when imitating the Perl syntax, and don’t allow any word characters to be escaped. Word characters include the underscore. \_ matches a literal underscore in Perl, but causes an error in .NET.

In many regex flavors, including the GNU implementation of POSIX ERE used in GNU egrep and many other open source projects, \< and \> aren’t needlessly escaped angle brackets. They’re word boundaries, matching the start of a word or the end of a word. GNU can extend POSIX this way, because \< and \> were illegal in the POSIX standard, and thus could never occur in a regex. Perl’s flexibility means that these word boundaries can never be supported by Perl using the same syntax, because it would break too many regular expressions created by developers who don’t know that angle brackets aren’t metacharacters. This creates a practical problem for developers who use both Perl and GNU utilities: Perl will happily take the GNU ERE regex \<word\>, but it won’t work as intended. GNU egrep will match word as a whole word only, while Perl looks for <word>.

Because of this, RegexBuddy flags \< and \> as an error on the Create tab when you’re using a flavor that doesn’t support these as word boundaries. It does this to make sure you’re not accidentally trying to use these as word boundaries with flavors that don’t support them. To make the error go away, either remove the needless backslash if you meant to match a literal angle bracket, or double-click the error on the Create tab to replace the escaped angle bracket with a lookaround combination that emulates the word boundary.

On the Test tab, RegexBuddy does not complain about escaped angle brackets if the selected regex flavor doesn’t complain about them either. So you can ignore RegexBuddy’s warning and see the same results with escaped angle brackets in RegexBuddy as with the actual regex engine that you’ve asked RegexBuddy to emulate.

Thursday, 18 December 2008

TPerlRegEx Now with Proper UTF-8 (Unicode) Support

Filed under: Regex Libraries — Jan Goyvaerts @ 16:36

Five months ago I released TPerlRegEx for Delphi 2009 which enables PCRE’s UTF-8 support when compiled with Delphi 2009, so you can use TPerlRegEx with Unicode strings. TPerlRegEx still supports Delphi 2007 and earlier using Ansi strings.

Unfortunatety, until today’s new release, TPerlRegEx for Delphi 2009 had a rather embarrasing bug: it didn’t actually enable the UTF-8 support in PCRE if you did not set the Options property to something different than the default. That means SetOptions doesn’t get called, which was the only spot where TPerlRegEx set the PCRE_UTF8 flag.

The new release of TPerlRegEx fixes this by adding these five lines to the constructor:

{$IFDEF UNICODE}
  pcreOptions := PCRE_UTF8 or PCRE_NEWLINE_ANY;
{$ELSE}
  pcreOptions := PCRE_NEWLINE_ANY;
{$ENDIF}

This makes sure TPerlRegEx tells PCRE to use UTF8 when TPerlRegex uses UTF8String.

When using the buggy version of TPerlRegEx with Delphi 2009, this subject:

PerlRegEx1.Subject := '€';

will match ^.{3}$ but not ^.$.

When using the corrected version of TPerlRegEx, ^.{3}$ fails while ^.$ matches.

The reason is that in UTF-8, the euro symbol is encoded as three bytes. When PCRE operates in UTF-8 mode, the dot matches one Unicode code point, regardless of how many bytes that code point is encoded with in UTF-8. (All code points take between 1 and 4 bytes.) But when the PCRE_UTF8 flag is not set, PCRE operates in 8-bit mode, and the dot always matches one byte.

In Delphi 2007 and prior, ^.$ also mathes the euro symbol, because it takes up one byte in the Windows (Ansi) code pages.

If you download TPerlRegEx again, it’ll properly handle Unicode in Delphi 2009.

Wednesday, 3 December 2008

More Time for Blogging

Filed under: About Regex Guru — Jan Goyvaerts @ 15:44

For the past six months or so, I spent most of the time I have for writing on writing a book about regular expressions instead of blogging. I’m done writing my part of the book, while Steven wraps up the last chapter. There’s still a lot of work to be done before it’ll appear in print. Most of that work is for O’Reilly, not for Steven and myself. Thus, I should have more time for blogging now. I’ll let you know when the book is available or pre-order.

Sunday, 2 November 2008

Detecting URLs in a Block of Text

Filed under: Regex Examples — Jan Goyvaerts @ 7:57

In his blog post The Problem with URLs, Jeff Atwood points out some of the issues with trying to detect URLs in a larger body of text using a regular expression.

The short answer is that it can’t be done. Pretty much any character is valid in URLs. The very simplistic \bhttp://\S+ not only fails to differentiate between punctuation that’s part of the URL, and punctuation used to quote the URL. It also fails to match URLs with spaces in them. Yes, spaces are valid in URLs, and I’ve encountered quite a few web sites that use them over the years. It also forgets other protocols, such as https.

In RegexBuddy’s library, you’ll find this regex if you look up “URL: Find in full text”:

\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[A-Z0-9+&@#/%=~_|] (case insensitive)

Like every other regex for extracting URLs, it’s not perfect. The key benefit of this regex is that it uses a separate character class for the last character in the URL, which allows less punctuation characters than the character class for the other characters in the URL. It excludes punctuation that is unlikely to occur at the end of the URL, and more likely to be punctuation that’s part of the sentence the URL is quoted in. It does not allow parentheses at all.

In EditPad Pro’s syntax coloring schemes, which are fully editable and entirely based on regular expressions, you’ll often find this regex:

\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$]
(case insensitive)

The main difference with the previous regex is that this one matches URLs such as www.regexguru.com without the http:// protocol. People often type URLs that way in their documents and messages, because most browsers accept them that way too.

EditPad’s built-in “clickable URLs” syntax highlighting uses this regex:

\b(?:(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%?=~_|$!:,.;]*[-A-Z0-9+&@#/%=~_|$]
   | ((?:mailto:)?[A-Z0-9._%+-]+@[A-Z0-9._%-]+\.[A-Z]{2,4})\b)
|”(?:(?:https?|ftp|file)://|www\.|ftp\.)[^"\r\n]+”?
|’(?:(?:https?|ftp|file)://|www\.|ftp\.)[^'\r\n]+’?
(free-spacing, case insensitive)

This log regex adds three alternatives to the previous regex. It adds the ability to match email addresses, with or without mailto:, and it matches URLs between single or double quotes. When the URL is quoted, it allows all characters in the URL, except line breaks and the delimiting quote. This way, any URL with weird punctuation can be highlighted correctly by placing it between a pair of quote characters. Because this regex is used to highlight text as you type, the closing quotes are optional. The highlighting will run until the end of the line until you type the closing quote. Remove the question marks after the quote characters if you will use this regex to extract URLs.

So how about Jeff’s problem?

I couldn’t come up with a way for the regex alone to distinguish between URLs that legitimately end in parens (ala Wikipedia), and URLs that the user has enclosed in parens.

That’s not too hard, if we add the restriction that we only allow unnested pairs of parentheses in URLs. Using the second regex in this article as the starting point, add an alternative for a pair of parentheses to both character classes in that regex:

\b(?:(?:https?|ftp|file)://|www\.|ftp\.)
  (?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*
  (?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$])
(free-spacing, case insensitive)

This regex allows the same set of characters in the middle of the URL, mixed with zero or more sequences of those characters between parentheses. It allows the URL to end with the same reduced set of characters, or a final run between parentheses. Because we require the opening parenthesis to be in the URL, we don’t have to do anything complicated to check if any closing parentheses we encounter are part of the URL or not.

It’s important that you observe that in order to allow any number of pairs of parentheses in the middle of the regex, I moved the star from the character class to the group it is now in. I did not add another star to the group. A double-star combination like (a|b*)* is a sure-fire recipe for catastrophic backtracking.

All the regexes in this article will be included in RegexBuddy’s library with the next free minor update. Current version is 3.2.0.

« Previous PageNext Page »