Regex Guru

Sunday, 2 November 2008

Detecting URLs in a Block of Text

Filed under: Regex Examples — Jan Goyvaerts @ 7:57

In his blog post The Problem with URLs, Jeff Atwood points out some of the issues with trying to detect URLs in a larger body of text using a regular expression.

The short answer is that it can’t be done. Pretty much any character is valid in URLs. The very simplistic \bhttp://\S+ not only fails to differentiate between punctuation that’s part of the URL, and punctuation used to quote the URL. It also fails to match URLs with spaces in them. Yes, spaces are valid in URLs, and I’ve encountered quite a few web sites that use them over the years. It also forgets other protocols, such as https.

In RegexBuddy’s library, you’ll find this regex if you look up “URL: Find in full text”:

\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[A-Z0-9+&@#/%=~_|] (case insensitive)

Like every other regex for extracting URLs, it’s not perfect. The key benefit of this regex is that it uses a separate character class for the last character in the URL, which allows less punctuation characters than the character class for the other characters in the URL. It excludes punctuation that is unlikely to occur at the end of the URL, and more likely to be punctuation that’s part of the sentence the URL is quoted in. It does not allow parentheses at all.

In EditPad Pro’s syntax coloring schemes, which are fully editable and entirely based on regular expressions, you’ll often find this regex:

\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$]
(case insensitive)

The main difference with the previous regex is that this one matches URLs such as www.regexguru.com without the http:// protocol. People often type URLs that way in their documents and messages, because most browsers accept them that way too.

EditPad’s built-in “clickable URLs” syntax highlighting uses this regex:

\b(?:(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%?=~_|$!:,.;]*[-A-Z0-9+&@#/%=~_|$]
   | ((?:mailto:)?[A-Z0-9._%+-]+@[A-Z0-9._%-]+\.[A-Z]{2,4})\b)
|”(?:(?:https?|ftp|file)://|www\.|ftp\.)[^"\r\n]+”?
|’(?:(?:https?|ftp|file)://|www\.|ftp\.)[^'\r\n]+’?
(free-spacing, case insensitive)

This log regex adds three alternatives to the previous regex. It adds the ability to match email addresses, with or without mailto:, and it matches URLs between single or double quotes. When the URL is quoted, it allows all characters in the URL, except line breaks and the delimiting quote. This way, any URL with weird punctuation can be highlighted correctly by placing it between a pair of quote characters. Because this regex is used to highlight text as you type, the closing quotes are optional. The highlighting will run until the end of the line until you type the closing quote. Remove the question marks after the quote characters if you will use this regex to extract URLs.

So how about Jeff’s problem?

I couldn’t come up with a way for the regex alone to distinguish between URLs that legitimately end in parens (ala Wikipedia), and URLs that the user has enclosed in parens.

That’s not too hard, if we add the restriction that we only allow unnested pairs of parentheses in URLs. Using the second regex in this article as the starting point, add an alternative for a pair of parentheses to both character classes in that regex:

\b(?:(?:https?|ftp|file)://|www\.|ftp\.)
  (?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*
  (?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$])
(free-spacing, case insensitive)

This regex allows the same set of characters in the middle of the URL, mixed with zero or more sequences of those characters between parentheses. It allows the URL to end with the same reduced set of characters, or a final run between parentheses. Because we require the opening parenthesis to be in the URL, we don’t have to do anything complicated to check if any closing parentheses we encounter are part of the URL or not.

It’s important that you observe that in order to allow any number of pairs of parentheses in the middle of the regex, I moved the star from the character class to the group it is now in. I did not add another star to the group. A double-star combination like (a|b*)* is a sure-fire recipe for catastrophic backtracking.

All the regexes in this article will be included in RegexBuddy’s library with the next free minor update. Current version is 3.2.0.

Wednesday, 8 October 2008

R

Filed under: Regex Libraries — Jan Goyvaerts @ 17:09

R is the name of the programming language in the R project for statistical computing. It’s a bit different from the usual programming languages I work with. R primarily works on vectors, and so does its regular expression functionality. This means the new regular expression source code snippets for R in RegexBuddy 3.2.0 are a little different from those for the other languages RegexBuddy supports. But that’s a good thing. The regexp functionality in R works well with the rest of the language. The new article Regular Expressions with The R Language on regular-expressions.info has more details.

Though the regex features are tailored for R, the regex flavors are not. R supports three flavors, called basic, extended and perl in R’s documentation. I call these flavors GNU BRE, GNU ERE and PCRE. RegexBuddy supports all three, but the R source code snippets only use PCRE. PCRE is far more powerful than GNU BRE and GNU ERE. These old flavors based on the POSIX standard have the same functionality, but BRE uses a different syntax than ERE and PCRE.

Tuesday, 7 October 2008

Windows PowerShell

Filed under: Regex Libraries — Jan Goyvaerts @ 16:57

Windows PowerShell is Microsoft’s shell scripting language, based on the .NET framework. I’ve added an article to regular-expressions.info explaining how to use regular expressions with Windows PowerShell. RegexBuddy 3.2.0 now supports PowerShell strings and PowerShell source code snippets.

Since PowerShell is built on top of the .NET framework, .NET’s excellent regular expression support is also available to PowerShell programmers. PowerShell has special -match and -replace operators. These allow you to do the most common pattern matching and replacement jobs with PowerShell’s typical CmdLet and operator syntax.

For other functionality, such as splitting a string or replacing with a MatchEvaluator, you need to instantiate System.Text.RegularExpressions.Regex. One constructor of the Regex class is available through the shortcut [regex]. This constructor takes one parameter with your regex as a string. The Regex object allows you to do anything with regular expressions in PowerShell as in any other .NET language.

Tuesday, 19 August 2008

TPerlRegEx for Delphi 2009

Filed under: Regex Libraries — Jan Goyvaerts @ 12:12

TPerlRegEx is a Delphi VCL component wrapper around the open source PCRE library. I originally developed it for in-house use. It powered EditPad Pro 4 and 5, PowerGREP 1 and 2, and RegexBuddy 1. The latest versions of these products use a custom-built regular expression engine. The custom-built engine can do things such as searching through files larger than 4 GB (in PowerGREP) or emulate many regex flavors (in RegexBuddy), which aren’t typical usage scenarios for regular expressions.

The PCRE library is a great choice to add regex support to your Delphi applications. Though the library is written in C, Delphi can link the OBJ files output by a C compiler into your application. TPerlRegEx includes ready-made OBJ files, so you don’t have to worry about any of this.

The latest version of TPerlRegEx includes PCRE 7.7, with Unicode support enabled. To actually use the Unicode features, you’ll need Delphi 2009. In PerlRegEx.pas, you’ll see the following near the top of the unit:

{$IFDEF UNICODE}
type
  PCREString = UTF8String;
{$ELSE}
type
  PCREString = AnsiString;
{$ENDIF}

The UNICODE directive is defined by default in Delphi 2009, but not in Delphi 2007 or earlier. If you’re using Delphi 2007 or before, TPerlRegEx will work with AnsiString, which has been the default string type since Delphi 2.

When you migrate your application to Delphi 2009, the default string type becomes UnicodeString. PCRE does not support UTF-16. It only supports 8-bit strings (i.e. one byte per character), and UTF-8. Hence my decision to make TPerlRegEx use the new and improved UTF8String in Delphi 2009.

When you assign a varable declared as “string”, which really is UnicodeString in Delphi 2009, to a property such as TPerlRegEx.Subject, then the Delphi 2009 compiler will automatically do the UTF-16 to UTF-8 conversion for you. When you assign a property such as TPerlRegEx.MatchedExpression to a string variable, the UTF-8 to UTF-16 conversion is also automatic. The net result is that when you use TPerlRegEx and you upgrade from Delphi 2007 to Delphi 2009, your regular expressions automatically become Unicode-enabled.

The only caveat lies in position properties such as MatchedExpressionOffset and MatchedExpressionLength. These indicate byte positions in the UTF-8 strings that TPerlRegEx deals with. To make your code work correctly in Delphi 2009, use those positions with the TPerlRegEx.Subject property (which uses UTF-8) instead of your original string variable (which uses UTF-16 in Delphi 2009).

Note that in Delphi 2007, there’s no difference between UTF8String and AnsiString. Manually defining the UNICODE directive in Delphi 2007 will not make TPerlRegEx support Unicode. You’d have to add explicit calls to UTF8Encode and UTF8Decode to do the conversions.

All in all, porting TPerlRegEx from Delphi 2007 to 2009 was very easy. Changing the string declaration from AnsiString to UTF8String, and passing the PCRE_UTF8 flag to the pcre_compile function is all it really took. I wasted much more time upgrading the OBJ files from PCRE 4.5 to 7.7, which I ended up borrowing from the JCL project.

Download TPerlRegEx. Source is included under the MPL 1.1 license.

« Previous PageNext Page »