Regex Guru

Monday, 27 July 2009

Magically Generating Regular Expressions

Filed under: The Guru's Kitchen — Jan Goyvaerts @ 14:23

A lot of people are looking for a program that can automatically generate regular expressions for them. The program would only take examples of valid matches as input, and produce the proper regular expression as output, inferring the user’s idea of “proper” as by magic. It is certainly one of the more frequent feature request I receive for RegexBuddy. Unfortunately, no computer program will ever be able to generate a meaningful regular expression based purely on a list of valid matches. Let me show you why.

Suppose you provide the examples 111111 and 999999. Which regular expression should the computer generate?

  1. A regex matching exactly those two examples: (?:111111|999999)
  2. A regex matching 6 identical digits (\d)\1{5}
  3. A regex matching 6 ones and nines [19]{6}
  4. A regex matching any 6 digits \d{6}
  5. Any of the above four, with word boundaries, e.g. \b\d{6}\b
  6. Any of the first four, not preceded or followed by a digit, e.g. (?<!\d)\d{6}(?!\d)

As you can see, there are many ways in which examples can be generalized into a regular expression. The only way for the computer to build a predictable regular expression is to require you to list all possible matches. Then it could generate a search pattern that matches exactly those matches, and nothing else. Usually, providing an exhaustive list of matches is exactly what we’re trying to avoid. And when you do have an exhaustive list of all possible matches, an optimized plain text search processing the whole list at once will be as fast as or faster than a regex search. The plain text search can be optimized to scan the text only once, without backtracking like regular expressions do

If you don’t want to list all possible matches, you need a higher-level description. Instead of providing a long list of 6-digit numbers, you simply tell the program to match “any six digits”. The regular expression syntax itself is one way to provide such a description. Regular expressions are powerful enough that they can describe any text that doesn’t depend on its context. “Any six digits” is written as \d{6} in regular expression syntax.

To make the higher-level description easy to work with, it needs domain knowledge. Matching a date between January 1st and March 31st is much easier if your tool or language knows what a date is. This is where regular expressions fall short. Regular expressions only know about characters. Essentially, a regular expression describes which character comes next, or which characters are allowed next.

This wouldn’t be the regex guru blog if there story ended here. Though pure auto-generation of regexes isn’t feasible, there have to be better ways of creating text patterns than spelling things out character by character. I don’t mean trying to make regular expressions more accessible by coming up with a more verbose syntax, as some people have tried. Microsoft would be pushing COBOL# instead of C# if wordiness was key.

I mean a tool that works at a higher level. Something that allows you to say “a number between 1 and 12″ instead of “a digit between 1 and 9, or a digit 1 followed by a digit between 0 and 2″. That’s what I’ve been working on for the past two years. It’s the reason why there haven’t been any major Just Great Software releases since 2007.

RegexMagic is a brand new product from Just Great Software. Its purpose is to make it easier to create regular expressions, but without using the regular expression syntax. You can tell RegexMagic you want a date between January 1st and March 31st, and that you want it in yyyy-mm-dd format, simply by selecting the “date and time” pattern and setting its options. Once you’ve done that, RegexMagic magically spits out your regular expression. Version 1.0 has patterns for characters, numbers, dates, times, email addresses, URLs, credit card numbers, national IDs, and more.

In practice, most of the regular expressions you want won’t neatly fit into one of RegexMagic’s predefined patterns. If you mark 1.2.12 as a whole, RegexMagic will guess it’s a date (1 February 2012, German style), rather than a product version number. If want a regex that matches 3 numbers delimited by dots, mark the numbers and the dots separately into 5 fields. Select the integer pattern for the numbers and the literal text pattern for the dots. Then RegexMagic can again magically spit out your regex, even if it didn’t magically read your mind.

Once you have created your higher-level description in RegexMagic, which is called a RegexMagic formula, editing and customizing that description is trivial compared to editing a regex. If you decide that the parts of a version should be restricted to values between 1 and 255, simply set the limits on the integer patterns, and regenerate the regular expression.

Just like RegexBuddy, RegexMagic supports nearly all popular regular expression flavors. Select your flavor, and RegexMagic makes sure to generate a regular expression that works with it. RegexMagic can also generate snippets in many programming languages that you can copy and paste directly into your source code to implement your regular expression. RegexMagic uses the same source code templates as RegexBuddy.

Though RegexMagic is mainly aimed at people new to regular expressions, experts on regular expressions may still find RegexMagic useful to quickly generate more complicated regexes. E.g. if you need to match numbers between 256 and 512, and all you can use is a single regex, it takes less than a minute to make RegexMagic spit out 51[0-2]|50[0-9]|[34][0-9]{2}|2[6-9][0-9]|25[6-9]. Though such a regex is conceptually very simple, it is quite a chore to build up by hand.

For more details, a trial download, and an introduction discount, please visit RegexMagic’s web site.

Tuesday, 18 March 2008

If You Do It Differently, Document It Clearly

Filed under: The Guru's Kitchen — Jan Goyvaerts @ 18:19

Earlier today during development, I was writing some code that deals with mode modifiers. Most modern regex flavors use the (?i), (?s), (?m) and (?x) modifiers first used in Perl. Though the s and m modes are misnamed, at least they’re easy enough to remember once you get the hang of it.

Tcl’s ARE engine, however, tried to improve the situation. Instead of a “single line” and a “multi line” option that can both be on or off, yielding 4 states, Tcl uses the terms “non-newline-sensitive”, “partial newline-senstive”, “inverse partial newline-sensitive” (a.k.a. “weird”) and “newline-sensitive” for each of the 4 combinations, and four letters to go with the 4 names. The defaults are also different.

I can never remember Tcl’s matching modes. I don’t use Tcl other than for testing its regex engine. So I checked my own documentation on the subject. And I found I was contradicting myself. What I wrote in the bullet points contradicted other bullet points, and the comparison table with Perl further down the page. Turns out the (?w) and (?n) bullet points and table items were all wrong, in different ways.

To figure this out I consulted the official Tcl docs once more:

If newline-sensitive matching is specified, . and bracket expressions using ^ will never match the newline character (so that matches will never cross newlines unless the RE explicitly arranges it) and ^ and $ will match the empty string after and before a newline respectively, in addition to matching at beginning and end of string respectively. ARE \A and \Z continue to match beginning or end of string only.

If partial newline-sensitive matching is specified, this affects . and bracket expressions as with newline-sensitive matching, but not ^ and `$’.

If inverse partial newline-sensitive matching is specified, this affects ^ and $ as with newline-sensitive matching, but not . and bracket expressions. This isn’t very useful but is provided for symmetry.

I don’t know about you, but the above makes little sense to me. Testing Tcl’s engine again, it’s actually technically correct. Just hard to understand when explained like this. RegexBuddy does get its explanation right on the Create tab.

It doesn’t matter if Perl’s or Tcl’s way of specifying what RegexBuddy calls “dot matches newlines” and “^ and $ match at line breaks” is better. Perl’s the established way, and Tcl thinks it can do better. But Tcl then does a poor job of explaining its improvements, which only leads to confusion.

If you’re improving on established standards, make sure to explain yourself clearly. People are used the old ways, and will resist change, particularly if you make change difficult with poor documentation.

So what’s my opinion on “dot matches newlines” and “^ and $ match at line breaks”? The latter is obsolete. Perl, Tcl and most flavors that follow Perl, have \A and \Z to match the start and end of a subject. So redefining ^ and $ to match at embedded line breaks is fair game. In EditPad Pro and PowerGREP, “^ and $ match at line breaks” is permanently enabled, though you could put (?-m) at the start of your regex if you must. The “dot matches newlines” option is still useful, because doing the same with character classes is cumbersome. What Tcl’s docs call “weird” and “not very useful” is actually quite handy when dealing with data spread over multiple lines in a larger file (i.e. turning on both “dot matches newlines” and “^ and $ match at line breaks”).