Regex Guru

Thursday, 8 May 2008

No Follow The Lazy Dot

Filed under: Regex Trouble — Jan Goyvaerts @ 8:31

.*? is what I call the “lazy dot”. It matches any sequence of characters. It matches as few of them as needed to make the whole regex match. The problem is that if there’s no way for the regex to match, the lazy dot will continue all the way until the end of the line or the end of the subject string (if the dot is allowed to match newlines). If you have two lazy dots in a regex, they’ll both try expand to match the whole regex, trying every possible permutation between them. That leads to catastrophic backtracking. The HTML file example at the bottom of that page shows how a bunch of lazy dots will get into a fight.

Yesterday I installed the Do Follow plugin on all my blogs. Looking at the plugin’s source, I saw that one regular expression was used to simply strip the ref=”nofollow” attributes that WordPress adds. Here it is, formatted as a PHP preg string as it appears in the code:

'/
  (
    <a\s+
    .*
    \s+
    rel=["\']
    [a-z0-9\s\-_\|\[\]]*
  )
  (
    \b
    nofollow
    \b
  )
  (
    [a-z0-9\s\-_\|\[\]]*
    ["\']
    .*
    >
  )
/isUx'

Notice the /U modifier at the end. This is a PHP flag that reverses the meaning of the question mark after quantifier. Normally, .* is greedy and .*? is lazy. With /U, .* is lazy and .*? is greedy. You could call /U the Uber Lazy mode because it even saves you typing the extra ?.

The problem with this regex is that when it encounters an HTML anchor that does not have ref=”nofollow”, the first lazy dot will expand all the way to the end of the HTML code it’s trying to strip the “nofollow” from. That’s very inefficient. (Of course, the whole business of stripping off something that shouldn’t be added in the first place is very inefficient. Suffice to say I’m less and less impressed with WordPress each day.)

Here’s my version:

'/
  (
    <a\s+
    [^<>]*?
    \s+
    rel=["\']
    [a-z0-9\s\-_|[\]]*?
  )
  (
    \b
    nofollow
    \b
  )
  (
    [a-z0-9\s\-_|[\]]*
    ["\']
    [^<>]*
    >
  )
/isx'

I removed the /U flag. I replaced the dot with the far more sensible negated character class [^<>]. Angle brackets can’t occur within an HTML anchor. Coding this small piece of information into the regex is all it takes to make it stop at the closing > of any anchor tag that doesn’t use “nofollow” already. I made the first set of quantifiers lazy, and the second set greedy, to minimize the amount of backtracking needed. But that’s a minor issue. The major savings is too make sure the regex doesn’t needlessly scan through everything that follows after an HTML anchor without “nofollow”. Are you following me? :-)

3 Comments

  1. Your regex only works when there’re at least two whitespace chars in front of the rel-attribute… here is a fixed version.

    Comment by Christoph Roeder — Thursday, 8 May 2008 @ 13:26

  2. Christoph,

    Your regex didn’t survive WordPress’s HTML conversion, so I removed from your comment.

    Your point is valid, though. I’ll blog about it.

    Comment by Jan — Thursday, 8 May 2008 @ 14:35

  3. […] regex in the Do Follow plugin does not match all HTML anchor tag with a rel=”nofollow” attribute. When wordpress […]

    Pingback by Follow Up with Adequate Testing - Regex Guru — Thursday, 8 May 2008 @ 15:05

Sorry, the comment form is closed at this time.