Regex Guru

Monday, 27 April 2009

Split() is Not Always The Best Way to Split a String

Filed under: Regex Trouble — Jan Goyvaerts @ 16:17

Most programming languages have a split() function that takes a regular expression and a string. It returns an array with the parts of the string between the regex matches. The regex matches themselves are discarded.

The split() function is great when it’s easy to write a regular expression to match the delimiters. You can easily split a string along commas: split(/,/, subject). In fact, you don’t even need a regex for this. Splitting on an (X)HTML break tag does require a regex, but is still very easy: <br \s*/?>.

The split() function is terrible when the delimiters can occur in the split content. A common job is to split a string along commas, except when those commas appear in double quotes. Such a string might be a line in a CSV file. Every now and then on the RegexBuddy user forum somebody asks for help with fixing some hideous contraption of lookahead and lookbehind that is intended to match an unquoted comma.

In such cases, it is much easier to write a regex that matches the content you want to keep in the array, and use findall() instead of split(). Instead of writing a regex that matches what you want split() to throw away, you write a regex that matches what findall() should keep. While writing a regex to match unquoted commas is hard, writing one for CSV fields is quite straightforward:

"[^"]*"|[^,]+

This regex matches a pair of double quotes with anything except double quotes between them, or a series of characters that don’t include a comma. Those are the two forms that fields can use in CSV files. If you want to allow escaped quotes in quoted fields, this regex is a bit longer but very efficient:

"[^"\\]*(?:\\.[^"\\]*)*"|[^,]+

Both these regular expressions assume that a field that begins with a double quote also ends with one.

Another question I got recently was how to split text on semicolons, except for semicolons that are part of an HTML entity such as &amp;, using a regex flavor that does not support lookbehind. Again, the easiest way to solve this problem is to turn it onto its head. Instead of figuring out a way to exclude semicolons that are part of an entity without using lookbehind, just write a regex that matches text with entities:

[^&;]+(?:&\w+;[^&;]*)

This regular expression assumes that ampersands only occur in the text as part of entities. It matches any text excluding ampersands and semicolons, optionally followed by an entity and more text without ampersands and semicolons.

2 Comments

  1. “[^"\\]*(?:\\.[^"\\]*)*”|[^,]+

    doesn’t work if you have an empty comma, for example

    ID,NAME
    ,”TEST”

    will only return “TEST”

    Comment by Christian — Saturday, 4 June 2011 @ 19:06

  2. @Christian … He said that in the summary …

    “[^"\\]*(?:\\.[^"\\]*)*”|[^,]+

    Both these regular expressions assume that a field that begins with a double quote also ends with one.

    Comment by Edward Beckett — Friday, 7 September 2012 @ 6:05

Sorry, the comment form is closed at this time.