Regex Trouble
Regex Trouble is the category on this blog where I’ll discuss various mistakes people tend to make with regular expressions.
A few months ago I added a “Pitfalls” section to the regular-expressions.info web site. It sits a bit tucked away in the examples section, so you might have missed it. There are four articles in this section thus far.
Catastrophic Backtracking: This is my term for runaway regular expressions that have so many possible permutations, they cause the regular expression engine to need more time than the estimated life expectancy of the universe, unless it runs out of stack space and crashes your whole application. But even if the engine is smart enough to detect the problem and abort the match attempt, you’re still not getting the matches you wanted. If you’re addicted to the lazy dot .*?, this article is a must-read.
Making Everything Optional: If everything in your regex is optional, you’ll end up with zero-width matches all over the place. E.g. in a floating point number, the integer, dot and fraction are all optional. But not all at the same time! You have to specify that in your regex.
Repeated Capturing Group: a(bc)+d is not the same as a((bc)+)d when it comes to $1 in the replacement text.
Mixing Unicode and 8-bit Character Codes: There’s life beyond ASCII. Make sure you know the difference between \u0080 and \x80.
Yes, catastrophic backtracking (or even not “catastrohpic” but just highly inefficient) is a major problem. I find the biggest one is people not only just stick a .* willy-nilly, but they only think about the successful match case, vs. what the engine will do if the pattern _doesn’t_ match.
Anyway, keep up the good work!
Comment by Hadriel — Saturday, 29 March 2008 @ 3:25