Regex Guru

Saturday, 15 March 2008

Convenience and Compatibility

Filed under: Regex Philosophy — Jan Goyvaerts @ 15:39

Fixing up user input with a regular expression makes your application or web site more convenient for the user. In cases where there is little forward or backward compatibility to worry about, you can apply broad strokes to make things as easy as possible on the user. E.g. on an order form, stripping out all non-digits from the credit card number is no problem. The number is entered once, the card is charged once, and then the data is discarded. If the customer wants to make another purchase in the future, the card number has to be typed in again. If at that future moment, credit card numbers include punctuation, you’ll already have adapted your order form to deal with that, and the customer will already be aware of the new credit card rules.

Things are different when data persists. At the onset, you have forward compatibility to worry about. Future versions of your software will have to deal with the old data. Later, you have backward compatibility to worry about. If old versions of the software made assumptions about invalid data they shouldn’t have, you’ll be cursed to deal with that invalid data forever, or make the new software incompatible with the old data.

But even here, compatibility doesn’t always trump convenience. Take HTML and XHTML. HTML went for convenience, by design and more so by implementation. It’s case insensitive, quoting attribute values is optional, etc. Most browsers are even more forgiving than the HTML standard itself. The result is that we have an Internet flooded with technically invalid HTML. Yet somehow, 99% of all Internet users are oblivious to all this.

XHTML went the other way. By design, it’s very strict. Case sensitive, strict syntax, etc. Most browsers, however, still happily render invalid XHTML, at least by default.

A lot of programmers obsess about making sure there web pages are valid HTML or XHTML. That’s great. But many also lament why there is so much invalid HTML floating around.

I disagree. Lots of invalid HTML is great, because the alternative is very likely to be no HTML at all. HTML became such a popular format precicely because you could make the sloppiest “Hello world” ever, and still see “Hello world” in your browser instead of “Syntax error”. A badly rendered web page isn’t the end of the world. In this situation, wide adoption of the web by the masses is far more important than the ability for technically skilled webmasters to obtain pixel-perfect spacing. And I say that even though I personally do like pixel-perfect spacing.

Web browsers trying to make sense out of invalid HTML isn’t a problem at all. The fact that it permeates invalid HTML only inconveniences the people developing web browsers and other software that renders HTML. Hard work by a small number of browser developers makes the web convenient for everybody.

The problem with browsers is not that they try to render invalid HTML, but that they render valid HTML inconsistently. And that’s a compatibility issue that does affect everybody who tries to create web pages. If browser developers want to make the Internet perfectly convenient, they’ll have to treat each major past browser release as a de facto standard that future browsers have to support. This is what the Internet Explorer team is doing now.

So what’s the regex guru’s advice? If you’re writing code that takes input from other code, use your regular expressions to make sure the provided input is exactly what it needs to be. If not, return an error. This way, all invalid data is automatically reserved for future use. Anything you accept as valid or invalid-but-i-know-what-you-mean, you’ll have to accept in the same way forever. Best to leave the i-know-what-you-mean conversion to helper functions.

But if you’re taking input from humans, particularly those who are paying you for the privilege of giving your web site or software their input, make their lives easy. It’s hard work sometimes, but you’ll have more customers!

No Comments

No comments yet.

Sorry, the comment form is closed at this time.