Regex Guru

Friday, 4 April 2008

Escape Characters Only When Necessary

Filed under: Regex Philosophy — Jan Goyvaerts @ 12:35

A lot of people seem to have a habit of escaping all non-alphanumeric characters that they want to treat as literals in their regular expressions. E.g. to match #1+1=2 they’ll write \#1\+1\=2 instead of #1\+1=2. Though these regexes are equivalent in all modern regex flavors, the extraneous backslashes don’t exactly make the pattern more readable. And when formatted as a C++ string, "\\#1\\+1\\=2" is definitely a step back from "#1\\+1=2".

Beyond redability, needlessly escaping characters can also lead to subtle problems. In most flavors, < and \< both match a literal <. But in some flavors, like the GNU flavors, < is a literal and \< is a word boundary.

Similarly, _ and \_ usually simply match _. But the .NET framework treats \_ as an error, just like most modern flavors treat escaped letters that don’t form a regex token, like \j, as an error. This is done to reserve these letters for future expansion. I recommend that you treat non-alphanumerics the same, and escape only metacharacters.

Modern regex flavors have 11 metacharacters outside character classes: the opening square bracket [, the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening round bracket ( and the closing round bracket ).

The closing square bracket and the curly braces are indeed not in this list. The closing square bracket is an ordinary character outside character classes. Sometimes I do escape it for readability, e.g. when using a regex like \[[0-9a-f]\] to match [a]. The opening curly brace only needs to be escaped if it would otherwise form a quantifier like {3}. An exception to this rule is Java, which always requires { to be escaped.

Inside character classes, different metacharacters apply. Namely, the caret ^, the hyphen -, the closing bracket ] are the backslash itself are metacharacters. You can actually avoid escaping these, except for the backslash, by positioning them so that their special meaning cannot apply. You can place ] right after the opening bracket, - right before the closing bracket and ^ anywhere except right after the opening bracket. So []^\\-] matches any of the 3 metacharacters inside character classes. Again, one flavor has to deviate from normal practice. The JavaScript standard treats [] as an empty character class. This is not very useful, as it can never match anything. No surprise that the Internet Explorer developers got this wrong, and follow the usual practice of treating ] after [ as a literal. I recommend that you escape the 4 metacharacters inside character classes for maximum compatibility with various flavors, and to make your regex easier to understand by other developers who may be confused by something like []^\\-]. But don’t needlessly add backslashes to a regex like [*/._] which is perfectly fine without.

6 Comments

  1. Hello Jan,

    I was wondering why there isn’t an option in regexbudy that lets you paste a string from the clipboard and does the escaping automaticaly for you. Is there a reason why it’s not already there?

    Kind Regards, Tom

    Comment by Tom Pester — Friday, 4 April 2008 @ 15:27

  2. That option already exists in RegexBuddy. Right-click on the regular expression and select Insert Token, Literal Text in the context menu. Paste your text into the dialog box that appears. RegexBuddy will automatically escape metacharacters as required by the selected regular expression flavor.

    Comment by Jan — Friday, 4 April 2008 @ 16:23

  3. also, in PCRE you can use \Q…\E to “quote” a subexpression, treating all characters inside as literals

    Comment by toupeira — Saturday, 5 April 2008 @ 19:38

  4. Thanks for the info Jan. I expected an item in the context menu called “Paste” but it makes sense now.

    Comment by Tom Pester — Tuesday, 15 April 2008 @ 16:29

  5. It’s great to tell people to only escape characters when necessary, but without a readily available link/reference to ALL escape characters it’s a hard sell.

    Comment by Tristan — Friday, 20 January 2012 @ 3:21

  6. Special Characters in Regular Expressions

    Comment by Jan Goyvaerts — Monday, 27 August 2012 @ 19:56

Sorry, the comment form is closed at this time.