Unintended Backtracking Can Bite You
Backtracking occurs when the regular expression engine encounters a regex token that does not match the next character in the string. The regex engine will then back up part of what it matched so far, to try different alternatives and/or repetitions. Understanding this process will make all the difference between guessing and understanding why a regular expression matches what it does and doesn’t.
But even when you know all about backtracking, it’s still easy to miss unintended consequences when you don’t test the regex on enough examples of things you don’t want to match. In the tutorial topic about backreferences, I showed the regular expression <([A-Z][A-Z0-9]*)[^>]*>.*?</\1> as a way to match a pair of HTML tags, like <tag>content</tag>. It does this just fine. Can you figure out when this regex will go wrong?
The regex won’t go wrong on properly nested HTML tags. The problem occurs when the tags don’t match. E.g. the regex will match <tagxyz>content</tag>, even though these aren’t matching HTML tags.
At first, ([A-Z][A-Z0-9]*) will store tagxyz into the first capturing group. This eventually causes \1 to fail, as tagxyz doesn’t match tag. But the tireless regex engine doesn’t stop here. Along the way, it has noticed that [A-Z0-9]* has repeated 5 times. That’s five steps the regex engine can backtrack. So it goes back to capturing tagxy into \1. The following [^>]* then matches the z.
Now you might say that deleting [^>]* from the regex will solve the problem. It will, if you don’t want to match HTML tags with attributes like <tag attr="value">content</tag>. But we want to allow those, so we need the [^>]* to skip over any attributes the tag might have, or even just some extra whitespace between the tag name and the >.
\1 will fail to match until the engine has backtracked enough so that ([A-Z][A-Z0-9]*) matches tag, and the [^>]* skips the xyz. Then \1 matches tag and a valid match is found for the whole regex.
So how to fix this? There are two solutions.
One solution is to be more precise about what we want. The name of the tag must be a whole word. The [^>]* can’t be allowed to match part of the tag. We can easily specify this by adding a word boundary to the mix: <([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>. The regex engine will still backtrack when \1 fails to match. But each time the engine makes [A-Z0-9]* give up one more character, \b will immediately fail to match.
This is the change that I made to the tutorial topic about backreferences. The word boundary is easy to understand, and supported by all modern regex flavors.
The other solution is to tell the regex engine to forget about backtracking alltogether. The two greedy quantifiers in this regular expression never have a reason to backtrack. ([A-Z][A-Z0-9]*) will match the tag’s name straight away. If it backtrack, it cuts of part of the tag name, which we never want. There’s no point in backtracking [^>]* either. It will eat up everything up to the tag’s closing >. There’s no point in backtracking it, since [^>]* and > are mutually exclusive. If [^>]* gives up a character it previously matched, > could never match it. Some regex engines, like the one in the Just Great Software products, even detect this and automatically disable backtracking for mutually exclusive tokens.
There are two ways to disable backtracking on a quantifier. The easiest one is to use possesive quantifiers. Only a few regex flavors support these. <([A-Z][A-Z0-9]*+)[^>]*+>.*?</\1> uses two possessive quantifiers to match the opening tag. Like before, ([A-Z][A-Z0-9]*+) will capture tagxyz and \1 will fail. The .*? will still expand to look for a matching closing tag. But if it can’t find one, that’s the end of the story.
A few more regular expression flavors support atomic grouping. When the regex engine exists an atomic group, it instantly forgets about all backtracking positions it remembered inside the group. <(?>([A-Z][A-Z0-9]*)[^>]*)>.*?</\1> puts the opening tag in an atomic group. This regex will work in exactly the same way, though the mechanics inside the regex engine is slightly different. (Possessive quantifiers never store backtracking positions. Using atomic grouping, the greedy quantifiers store backtracking positions, which are subsequently thrown away.)
There are still things that the improved regex can’t do, like dealing with nested identical tags and incorrectly nested tags. Obviously, you can’t write a full HTML parser with just one regex. Always test your regex on both good and bad data to see if it does what you want.
My philosophy is that the more backtracking control and anchoring you apply, the safer you are. When I see something like <([A-Z][A-Z0-9]*)[^>]*>.*?</\1> from someone like you who actually knows what they’re doing, the lack of \b immediately stands out as code golfing. I’d probably trust that you already thought through the backtracking scenarios and determined that the word boundary (or atomic group) wasn’t necessary, but I’d feel a little more uneasy. Clever brevity at the expense of explicitness is usually where I get bitten on unintended backtracking. But then I really like code golfing.
Comment by Steve — Wednesday, 9 April 2008 @ 22:56
Code golfing?
Comment by drazle — Monday, 14 April 2008 @ 21:33
Golfing is a term particularly common with the Perl crowd that describes trying to make code as short as possible, similar to golfing where the person with the lowest number of strokes wins. But golfing tends to be more extreme and singularly focused on code length than anything discussed here.
Comment by Steve — Wednesday, 23 April 2008 @ 20:47