Regex Guru

Tuesday, 18 March 2008

If You Do It Differently, Document It Clearly

Filed under: The Guru's Kitchen — Jan Goyvaerts @ 18:19

Earlier today during development, I was writing some code that deals with mode modifiers. Most modern regex flavors use the (?i), (?s), (?m) and (?x) modifiers first used in Perl. Though the s and m modes are misnamed, at least they’re easy enough to remember once you get the hang of it.

Tcl’s ARE engine, however, tried to improve the situation. Instead of a “single line” and a “multi line” option that can both be on or off, yielding 4 states, Tcl uses the terms “non-newline-sensitive”, “partial newline-senstive”, “inverse partial newline-sensitive” (a.k.a. “weird”) and “newline-sensitive” for each of the 4 combinations, and four letters to go with the 4 names. The defaults are also different.

I can never remember Tcl’s matching modes. I don’t use Tcl other than for testing its regex engine. So I checked my own documentation on the subject. And I found I was contradicting myself. What I wrote in the bullet points contradicted other bullet points, and the comparison table with Perl further down the page. Turns out the (?w) and (?n) bullet points and table items were all wrong, in different ways.

To figure this out I consulted the official Tcl docs once more:

If newline-sensitive matching is specified, . and bracket expressions using ^ will never match the newline character (so that matches will never cross newlines unless the RE explicitly arranges it) and ^ and $ will match the empty string after and before a newline respectively, in addition to matching at beginning and end of string respectively. ARE \A and \Z continue to match beginning or end of string only.

If partial newline-sensitive matching is specified, this affects . and bracket expressions as with newline-sensitive matching, but not ^ and `$’.

If inverse partial newline-sensitive matching is specified, this affects ^ and $ as with newline-sensitive matching, but not . and bracket expressions. This isn’t very useful but is provided for symmetry.

I don’t know about you, but the above makes little sense to me. Testing Tcl’s engine again, it’s actually technically correct. Just hard to understand when explained like this. RegexBuddy does get its explanation right on the Create tab.

It doesn’t matter if Perl’s or Tcl’s way of specifying what RegexBuddy calls “dot matches newlines” and “^ and $ match at line breaks” is better. Perl’s the established way, and Tcl thinks it can do better. But Tcl then does a poor job of explaining its improvements, which only leads to confusion.

If you’re improving on established standards, make sure to explain yourself clearly. People are used the old ways, and will resist change, particularly if you make change difficult with poor documentation.

So what’s my opinion on “dot matches newlines” and “^ and $ match at line breaks”? The latter is obsolete. Perl, Tcl and most flavors that follow Perl, have \A and \Z to match the start and end of a subject. So redefining ^ and $ to match at embedded line breaks is fair game. In EditPad Pro and PowerGREP, “^ and $ match at line breaks” is permanently enabled, though you could put (?-m) at the start of your regex if you must. The “dot matches newlines” option is still useful, because doing the same with character classes is cumbersome. What Tcl’s docs call “weird” and “not very useful” is actually quite handy when dealing with data spread over multiple lines in a larger file (i.e. turning on both “dot matches newlines” and “^ and $ match at line breaks”).

No Comments

No comments yet.

Sorry, the comment form is closed at this time.