Regex Guru

Friday, 4 April 2008

Escape Characters Only When Necessary

Filed under: Regex Philosophy — Jan Goyvaerts @ 12:35

A lot of people seem to have a habit of escaping all non-alphanumeric characters that they want to treat as literals in their regular expressions. E.g. to match #1+1=2 they’ll write \#1\+1\=2 instead of #1\+1=2. Though these regexes are equivalent in all modern regex flavors, the extraneous backslashes don’t exactly make the pattern more readable. And when formatted as a C++ string, "\\#1\\+1\\=2" is definitely a step back from "#1\\+1=2".

Beyond redability, needlessly escaping characters can also lead to subtle problems. In most flavors, < and \< both match a literal <. But in some flavors, like the GNU flavors, < is a literal and \< is a word boundary.

Similarly, _ and \_ usually simply match _. But the .NET framework treats \_ as an error, just like most modern flavors treat escaped letters that don’t form a regex token, like \j, as an error. This is done to reserve these letters for future expansion. I recommend that you treat non-alphanumerics the same, and escape only metacharacters.

Modern regex flavors have 11 metacharacters outside character classes: the opening square bracket [, the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening round bracket ( and the closing round bracket ).

The closing square bracket and the curly braces are indeed not in this list. The closing square bracket is an ordinary character outside character classes. Sometimes I do escape it for readability, e.g. when using a regex like \[[0-9a-f]\] to match [a]. The opening curly brace only needs to be escaped if it would otherwise form a quantifier like {3}. An exception to this rule is Java, which always requires { to be escaped.

Inside character classes, different metacharacters apply. Namely, the caret ^, the hyphen -, the closing bracket ] are the backslash itself are metacharacters. You can actually avoid escaping these, except for the backslash, by positioning them so that their special meaning cannot apply. You can place ] right after the opening bracket, - right before the closing bracket and ^ anywhere except right after the opening bracket. So []^\\-] matches any of the 3 metacharacters inside character classes. Again, one flavor has to deviate from normal practice. The JavaScript standard treats [] as an empty character class. This is not very useful, as it can never match anything. No surprise that the Internet Explorer developers got this wrong, and follow the usual practice of treating ] after [ as a literal. I recommend that you escape the 4 metacharacters inside character classes for maximum compatibility with various flavors, and to make your regex easier to understand by other developers who may be confused by something like []^\\-]. But don’t needlessly add backslashes to a regex like [*/._] which is perfectly fine without.

Saturday, 15 March 2008

Convenience and Compatibility

Filed under: Regex Philosophy — Jan Goyvaerts @ 15:39

Fixing up user input with a regular expression makes your application or web site more convenient for the user. In cases where there is little forward or backward compatibility to worry about, you can apply broad strokes to make things as easy as possible on the user. E.g. on an order form, stripping out all non-digits from the credit card number is no problem. The number is entered once, the card is charged once, and then the data is discarded. If the customer wants to make another purchase in the future, the card number has to be typed in again. If at that future moment, credit card numbers include punctuation, you’ll already have adapted your order form to deal with that, and the customer will already be aware of the new credit card rules.

Things are different when data persists. At the onset, you have forward compatibility to worry about. Future versions of your software will have to deal with the old data. Later, you have backward compatibility to worry about. If old versions of the software made assumptions about invalid data they shouldn’t have, you’ll be cursed to deal with that invalid data forever, or make the new software incompatible with the old data.

But even here, compatibility doesn’t always trump convenience. Take HTML and XHTML. HTML went for convenience, by design and more so by implementation. It’s case insensitive, quoting attribute values is optional, etc. Most browsers are even more forgiving than the HTML standard itself. The result is that we have an Internet flooded with technically invalid HTML. Yet somehow, 99% of all Internet users are oblivious to all this.

XHTML went the other way. By design, it’s very strict. Case sensitive, strict syntax, etc. Most browsers, however, still happily render invalid XHTML, at least by default.

A lot of programmers obsess about making sure there web pages are valid HTML or XHTML. That’s great. But many also lament why there is so much invalid HTML floating around.

I disagree. Lots of invalid HTML is great, because the alternative is very likely to be no HTML at all. HTML became such a popular format precicely because you could make the sloppiest “Hello world” ever, and still see “Hello world” in your browser instead of “Syntax error”. A badly rendered web page isn’t the end of the world. In this situation, wide adoption of the web by the masses is far more important than the ability for technically skilled webmasters to obtain pixel-perfect spacing. And I say that even though I personally do like pixel-perfect spacing.

Web browsers trying to make sense out of invalid HTML isn’t a problem at all. The fact that it permeates invalid HTML only inconveniences the people developing web browsers and other software that renders HTML. Hard work by a small number of browser developers makes the web convenient for everybody.

The problem with browsers is not that they try to render invalid HTML, but that they render valid HTML inconsistently. And that’s a compatibility issue that does affect everybody who tries to create web pages. If browser developers want to make the Internet perfectly convenient, they’ll have to treat each major past browser release as a de facto standard that future browsers have to support. This is what the Internet Explorer team is doing now.

So what’s the regex guru’s advice? If you’re writing code that takes input from other code, use your regular expressions to make sure the provided input is exactly what it needs to be. If not, return an error. This way, all invalid data is automatically reserved for future use. Anything you accept as valid or invalid-but-i-know-what-you-mean, you’ll have to accept in the same way forever. Best to leave the i-know-what-you-mean conversion to helper functions.

But if you’re taking input from humans, particularly those who are paying you for the privilege of giving your web site or software their input, make their lives easy. It’s hard work sometimes, but you’ll have more customers!

Tuesday, 11 March 2008

No Spaces or Dashes

Filed under: Regex Philosophy — Jan Goyvaerts @ 10:01

Programmers are lazy. That’s why we slave away hours on end to program our computers to automate tedious tasks for ourselves and others. That’s the right kind of laziness.

But far too often, programmers show the wrong kind of laziness. Instead of spending a bit more effort on their product, they push the responsibility to the end user.

How many times have you seen an order form that says “no spaces or dashes” next to the field for the credit card number? It always makes me wonder how seriously such companies take themselves. I mean, if credit card companies design their cards to use spaces to group the embossed digits, there’s probably a reason to it. Like making it easy for humans to keep track of how many digits they’ve already entered.

Compare these four lines of code, in Perl, PHP, JavaScript and HTML, respectively:

$cc =~ s/\D+//g;
$cc = preg_replace('/\D+/', '', $cc);
cc = cc.replace(/\D+/g, '');
<p>No spaces or dashes</p>

The first three lines let the computer strip out all non-digits. The HTML version tells the user to do it. Yet, the dumb HTML version takes about as many webmaster keystrokes as smart versions.

Regular expressions make it very easy to fix up user input. If an underlying library or remote service has strict input requirements, don’t force those requirements onto the user. Many credit card processing software has been around since the days CPU time was at a premium. So it’s likely that it doesn’t do fancy stuff like stripping out spaces and dashes. So use that little regex to bend the card number to the processor’s requirements, instead of making the user bend to it.

I’m sure if you spend half an hour going over the forms on your web site or the dialog boxes in your software, you can find many places where you could replace error messages with simple regexes that fix up the input.