Regex Guru

Tuesday, 15 April 2008

Watch Out for Zero-Length Matches

Filed under: Regex Trouble — Jan Goyvaerts @ 14:51

A zero-width or zero-length match is a regular expression match that does not match any characters. It matches only a position in the string. E.g. the regex \b matches between the 1 and , in 1,2.

Zero-lenght matches are often an unintended result of mistakenly making everything optional in a regular expression. Such a regular expression will in fact find a zero-length match at every position in the string. My floating point example has long shown this.

Apparently, JavaScript developers have it particularly tough. Different browsers handle zero-length matches differently. Steven Levithan argues that IE has a bug because it increments lastIndex. Steven’s observation is correct. When iterating over /\b/g.exec(), regex.lastIndex = match.index + 1 in Internet Explorer, while in other browsers they’re equal. So who’s got it wrong?

The ECMA-262 v3 standard defines the lastIndex property in 15.10.7.5 as:

The value of the lastIndex property is an integer that specifies the string position at which to start the next match.

It’s easy enough to understand this in the context where the developer sets lastIndex prior to calling exec() to make the match attempt start at a certain position. But how should the exec() method set lastIndex after a successful match?

For String.match() the standard says in 15.5.4.10:

If there is a match with an empty string (in other words, if the value of regexp.lastIndex is left unchanged), increment regexp.lastIndex by 1.

For String.replace() the standard says in 15.5.4.11:

Do the search in the same manner as in String.match(), including the update of searchValue.lastIndex.

But for RegExp.exec() the standard says in 15.10.6.2:

Let e be r’s endIndex value [i.e. the end of the match]. If the global property is true, set lastIndex to e.

The standard contradicts itself. 15.10.6.2 is inconsistent with the three other definitions, in that it omits the +1 in case of a zero-width match.

My opinion though is that, Internet Explorer got it right, and that browsers who implement 15.10.6.2 as written while ignoring the definition in 15.10.7.5 got it wrong. The omission of the lastIndex++ for regex.exec() looks to me as an oversight by the standards writers rather than something they did intentionally. The reason is that every regex engine that I know of works the way Internet Explorer. It’s the only way to avoid an infinite loop, like Firefox does.

If a zero-width match is found, the next match attempt begins one character further ahead in the string. After \b matches between the 1 and , in 1,2, the next match attempt will begin at the position between the , and the 2 (and match there), rather than staying stuck forever.

I do understand where the confusion comes from. The property is called lastIndex, but the standard defines it as something that should be called nextAttempt. lastIndex is not the end of the previous match. The ECMA-262 standard does not provide a property for that. To get that you have to add up match.index and match[0].length yourself.

Here’s my solution to the browser compatibility problem:

while (match = regex.exec(subject)) {
  // Prevent browsers like Firefox from getting stuck in an infinite loop
  if (match.index == regex.lastIndex) regex.lastIndex++;
  // Do whatever you want with the match
  start_of_match = match.index;
  length_of_match = match[0].length;
  first_character_after_match = start_of_match + length_of_match;
}

This code is easy to understand, and only uses one extra line (plus a comment) to work around the browser problems.

8 Comments

  1. Jan, thanks for your thoughtful and detailed follow-up. It’s gotten me to think more about the issue, and while I would agree that ECMAScript’s use of lastIndex is poorly designed, it doesn’t technically contradict itself. See my response to your comment on my blog.

    Comment by Steve — Wednesday, 16 April 2008 @ 10:45

  2. Why use interpositions? Are there any benefits from zero-length Regex matches, beyond delimitation of genuine matches? Basically, why hasn’t this nonsense been deprecated and avoided by now – what fundamental reason have I missed?

    Other examples of it include :
    1) Java substring() indexing
    2) “interbase” nucleic/protein seq indexing

    The same confusions arise time and time again, and it costs us all a fortune.
    Microsoft dodged this issue in C# substring indexing by making the second arg the actual length of the required substring, essentially shielding the coder from absurd internal implementation. I wish this was more widespread.

    Comment by Nigel F — Tuesday, 13 October 2009 @ 0:14

  3. Common use of zero-length matches are using ^ (start of line) and $ (end of line) by themselves in a search-and-replace to insert something at the start or the end of each line.

    In JavaScript too you get the position and length of the match. The lastIndex property is supposed to be an implementation detail that unfortunately surfaces because the implementations don’t agree. If there were many ECMA-334 (C#) implementations like there are many ECMA-262 (JavaScript) implementations, there would be similar compatibility problems.

    Comment by Jan Goyvaerts — Tuesday, 13 October 2009 @ 15:50

  4. Ok. I think I’ve got my head around this now. Thanks.
    Someone else summarised it with “interpositioning saves having to work out whether you need to incr or decr your index/length value, which can be a chore.”
    I think my problem with all this is the unintuitive terminology, and my growing conviction that the modern codebases, there’s no benefit anymore. I hope we can alias all this confusion away with 1-based indexing across the board……

    Comment by Nigel F — Wednesday, 14 October 2009 @ 2:01

  5. […] This comes into play, e.g., when using a regex to iterate over all matches in a string. However, the fact that lastIndex is actually set to the end position of the last match rather than the position where the next search should start (unlike equivalents in practically all programming languages) causes a problem after zero-length matches, which are easily possible with regexes like /w*/g or /^/mg. Hence, you’re forced to manually increment lastIndex in such cases. I’ve posted about this issue in more detail before (see: An IE lastIndex Bug with Zero-Length Regex Matches), as has Jan Goyvaerts (Watch Out for Zero-Length Matches). […]

    Pingback by What the JavaScript RegExp API Got Wrong, & How to Fix It — Monday, 1 March 2010 @ 17:47

  6. certainly start of line and end of line are important functions.

    what is the point of zero length matches with wildcards though? maybe they have a role in search and replace. since a zero length match can lead to insertion of text.

    Perhaps there should be 2 modes of regex, one for checking it is of a certain form, and capturing certain info. e.g. seeing that there’s an integer followed by a hiphen e.t.c. and capturing the indexes of these. The other mode of regex, for search and replace.

    Comment by robert — Sunday, 23 May 2010 @ 23:59

  7. […] According to Steven, IE’s implementation is a bug (http://blog.stevenlevithan.com/archives/exec-bugs). In Jan’s mind, IE actually did what was the right thing to do and changed the feature to work similarly to how the other RegEx methods work (although, as usual, IE isn’t following the standard as set out by the ECMAScript specification – http://www.regexguru.com/2008/04/watch-out-for-zero-length-matches/). […]

    Pingback by JavaScript’s Regular Expression exec/lastIndex bug? « Integralist — Monday, 7 June 2010 @ 3:23

  8. […] (mdn), non word is obviously everything else! The trick for word boundaries is that they are zero width assertions, which means they don’t count as a character. That’s why /wbw/ will never match, […]

    Pingback by Javascript Regular Expression | Epic Zoe — Wednesday, 8 May 2013 @ 13:15

Sorry, the comment form is closed at this time.