Regex Guru

Thursday, 8 May 2008

Follow Up with Adequate Testing

Filed under: Regex Trouble — Jan Goyvaerts @ 15:05

I always emphasize on the importance of testing your regular expressions on all possible input data. Particularly on any input that you might get that you don’t want the regex to match.

Writing a regular expression that matches something is often quite straightforward. Making it match all the variations of what you want can be tricky. Excluding all the things you don’t want is often the hard part.

The regex in the Do Follow plugin does not match all HTML anchor tag with a rel="nofollow" attribute. When wordpress encounters something like <a href="url"> in a comment, it changes it into <a href="url" rel="nofollow">. The plugin regex matches this just fine. So this isn’t really a bug in the plugin.

The problem arises if you were to blindly copy-and-paste this regular expression into your own code. A lot of programmers do that, and it’s a mistake. Always thoroughly test any regex you plan to use on both valid and invalid data. Try either the original regex or my improved regex on this: <a rel="nofollow" href="url">. It doesn’t work. But this anchor tag is valid. The order of attributes is irrelevant in HTML and XHTML.

The problem is that the author of the original regex used \s+ to force whitespace to occur after the a and before the rel parts of the match. This means the regex requires at least two spaces between the a and rel, possibly with other spaces and characters between those two spaces. But one space is actually sufficient.

An easy solution is to specify in the regex what is really meant. We don’t care if there are any spaces between the a and rel. What we require is that they are complete words. This is better done with word boundaries, like this:

'/
  (
    <a\b
    [^<>]*?
    \b
    rel=["\']
    [a-z0-9\s\-_|[\]]*?
  )
  (
    \b
    nofollow
    \b
  )
  (
    [a-z0-9\s\-_|[\]]*
    ["\']
    [^<>]*
    >
  )
/isx'

Even with this improvement, this is still a regular expression dedicated to a particular job. It will still match <a all this ref="nofollow" nonsense!!!> which is obviously not a valid URL. In the context of stripping ref=”nofollow” from comments, we don’t care. The point is, you should never copy-and-paste a regex without being sure of exactly what it does. The only way to be really sure is to test.

No Comments

No comments yet.

Sorry, the comment form is closed at this time.