Extreme regex foo: what you need to know to become a regular expression pro
Programming, Tutorials June 21st, 2007 - 27,528 viewsThis tutorial is intended for advanced audiences. If you’re new to regular expressions, or if you could use a quick refresher, go read my intro to regular expressions, and work through a few examples. Trust me, it’ll be one of the most rewarding twenty minutes you’ve ever spent. If you’re familiar with the basic regex concepts then read on and learn all you need to know to be a regex pro.
What you should know
This tutorial will go beyond the basics of regular expressions. You should already know what the metacharacters are (^[](){}.*?\|+$ and sometimes -) and how to use them, you should understand how to use parenthesis for grouping and capturing, and you should be able to construct basic regular expressions to match things like email addresses and URLs. If you’re lost, go read the articles I mentioned in the first paragraph then come back here.
You should also be aware that many advanced regex features work differently (or aren’t available at all) depending on the flavor of regular expression you’re using. I’ll be using perl compatible regular expressions (PCREs), which are far and away the most popular variety, and can be used in Perl, PHP, Ruby, ECMAScript/Javascript, C/C++, and practically every other programming language known to man.
What you will know
In this tutorial I’ll be introducing some advanced regex concepts that will allow you to parse text like a pro. I’ll be introducing lazy quantifiers, lookaround, pattern modifiers, and more. This is all good stuff, so let’s get started.
Greedy vs. Lazy Quantifiers
The standard quantifiers (?, *, +, and {min, max}) are greedy. When a quantifier is encountered in a regular expression, there is a minimum number of matches that are required before it’s considered successful, as well as a maximum number that can be matched. What’s important to know is that the standard quantifiers will always match as many times as they can.
Overlooking this rule is an exceedingly common trap that many novices fall into. Suppose I want to match emphasized text in an HTML document, for example. I could use a regular expression like <em>.*</em>, right? This expression would work fine for a string like “Mike’s website has <em>billions</em> of visitors” (I wish). But if the text has multiple em tags this regex won’t work. It will match from the beginning of the first <em> through to the last closing </em>, which isn’t at all what we want.
The lazy quantifiers are identical to the standard quantifiers, but have a trailing ‘?’ (question mark) appended to them. The lazy version of ‘*’, for example, is ‘*?’. Unlike their greedy counterparts, lazy quantifiers will match as few times as possible. This can be extremely useful, and can sometimes turn an exceedingly complicated regular expression into a simple one-liner. The regular expression <em>.*?</em> will only match up to the first closing </em> tag, which does just what we want (assuming there are no nested em tags).
Non-Capturing Parenthesis
By default, parenthesis capture whatever text they match, storing it for later use. There are times when you’ll want to group a subexpression together without capturing. To do so you can use the special notation (?: ). These non-capturing parenthesis behave exactly like normal parenthesis, but do not capture their contents. Note that the use of the ‘?’ (question mark) character has nothing to do with the “optional” ‘?’ metacharacter.
Using non-capturing parenthesis is good practice for two reasons. First, it will make the regex slightly more efficient. More importantly, it can make a regex easier to understand. A complicated regex may require several grouped subexpressions to extract a single piece of text. By using non-capturing parenthesis it’s immediately obvious which parts of the text you’re interested in extracting, and which parts merely provide context for your match.
Pattern Modifiers
Perl compatible regular expressions allow pattern modifiers (also called regex modifiers or just modifiers) to be placed after the closing delimiter of an expression. The modifiers affect how the expression is compiled and interpreted by the regex engine. There are four core modifiers that are frequently used and extremely useful. If you need more than one modifier, you can group them together and place them in any order following the closing delimiter of your regex.
The /i modifier
Enables case-insensitive matching, and is probably the most frequently used modifier. If I wanted to match the word “mike,” regardless of capitalization, I could use the regex /mike/i (note that I’ve included the regex delimiters in this example, which I haven’t done anywhere else, in order to demonstrate how to use pattern modifiers). There are a number of Unicode-related issues with case-insensitive matching (or loose matching, as Unicode calls it). If you’re matching Unicode text, it’s best to avoid this mode unless you know what you’re doing.
The /x modifier
Enables extended mode, which allows you to format complicated expressions so that they are more readable and maintainable. In extended mode, whitespace outside of character classes is ignored (or treated as a no-op metacharacter), and comments are allowed between # and a newline.
The /s modifier
Changes the behavior of the ‘.’ (dot) metacharacter to match all characters, including newlines. The ‘.’ metacharacter doesn’t match newlines by default, for mostly historical reasons. The original regex tools were Unix command line utilities that operated as filters on a line-by-line basis. Thus, matching a newline wasn’t even an issue. Which mode is most appropriate will depend on what you’re trying to match, so this modifier comes in handy on a pretty regular basis.
The /m modifier
Enables multiline mode, modifying where the line anchors (’^’ and ‘$’) match. By default, the line anchors do not match before and after embedded newlines. Instead they match only at the beginning and end of the entire subject string. Given the subject string “Mike\nMalone,” by default, the ‘^’ character will match only at the beginning of the string (before the ‘M’), and the ‘$’ character will match only at the end (after the last ‘e’). With multiline mode enabled, however, the ‘^’ character will match at the beginning of the string, and following the newline character. Similarly, the ‘$’ character will match at the end of the string, and preceding the newline character.
Character and Class Shorthands
There are a number of character shorthands that allow you to match control characters that would otherwise be difficult to represent. They are, for the most part, the familiar set of escaped characters that have been around since the C programming language was developed: ‘\n’ for a newline, ‘\r’ for a carriage return, ‘\t’ for a tab, etc.
There are also a series of class shorthands that represent common character classes and are frequently used to simplify expressions. A short list of these should suffice, they’re mostly self-explanatory.
- \d matches any decimal digit
- \D matches any character that is not a decimal digit (it’s the same as
[^\d]) - \s matches any whitespace character
- \S matches any non-whitespace character
- \w matches any “word” character (usually the same as
[a-zA-Z0-9_]) - \W matches any “non-word” character
Positive & Negative Lookahead / Lookbehind
Lookaround is the general term used for a group of constructs that allow you to ensure that a given expression exists in the text you’re matching, without actually matching anything. With positive lookahead, for example, you can write a regular expression that will match the word “mike,” but only if it’s immediately followed by the word “rocks.”
Positive lookahead is specified using the sequence (?= ). To match “mike” in the text “mike rocks,” for example, I would write mike(?= rocks). Another type of lookaround is lookbehind, which looks backwards. It is specified using the sequence (?<= ). Thinking of the ‘<=’ sequence as an arrow pointing backwards might help you remember this construct.
Negative lookahead and lookbehind work the same way, but are successful only when their subexpression does not match. Negative lookahead is specified using the (?! ) construct, and negative lookbehind is specified using (?<! ).
The most confusing thing about lookaround is understanding why it’s useful, so let’s look at an example that I’ve borrowed from Jeffrey Friedl’s excellent book Mastering Regular Expressions. If you want to display a large number (like 8927369280) in printed text, it’s often helpful to insert commas between each grouping of three numbers. The rule here is that we want to insert a comma at locations having digits on the right in exact sets of three, and at least one digit on the left. This can be accomplished fairly easily using lookaround.
We can fulfill the second requirement using lookbehind. The simple subexpression (?<=\d) will match locations that have at least one digit to their left. For the second requirement we need to match sets of three numbers up to the end of the string. The simple expression (\d\d\d)+$ accomplishes this task, and if we wrap it in a lookahead construct it will match at locations that are an even set of triple digits from the end of the string. So the completed regular expression
(?<=\\d)(?=(\\d\\d\\d)+$)
will match each location where a comma should be inserted. We can test out this regex using a simple perl script on the command line:
$ perl -e '$num = 8927369280;'\\ > '$num =~ s/(?<=\\d)(?=(\\d\\d\\d)+$)/,/g;'\\ > 'print $num, "\\n"' 8,927,369,280
Where to from here?
One final tip: whenever you write a particularly interesting or elegant regular expression, or whenever you find one that someone else has written, keep track of it somewhere (just as you would with a particularly elegant piece of code). Regular expressions are highly reusable and highly portable. And the tasks they accomplish are common to many applications (e.g., input validation, filtering, data-scraping, etc). If you use a regex once, you’ll probably find it useful again in the future.
If you’ve gotten this far, you’ve got a pretty comprehensive understanding of how regular expressions work. With some practice you’ll be a regex guru in no time. If you’re interested in learning more, however, I highly recommend Jeffrey Friedl’s book Mastering Regular Expressions. It’s an excellent read.
June 22nd, 2007 at 12:25 am
This will help a lot. RegEx is way over me. Its always a struggle every time I need to use it for something :/ Maybe one day Ill learn it :p
Thank you
June 22nd, 2007 at 7:01 am
Good post, Mike. Well written, and you explain things very simply. However, there are a couple relatively minor issues that can be somewhat misleading here, e.g., your description of “PCRE” (which is in fact a specific flavor, and from the programming languages you mentioned, it’s only actually used by PHP), and describing \b as a metasequence for a backspace (which is only true within character classes… elsewhere it matches a word boundary). Also, I have trouble with the idea of describing regexes as “highly portable.” For one, syntax and even fundamental engine mechanics differ between libraries, and secondly, if users don’t completely understand a pattern there is potential to be burned down the line if it’s used with (perhaps subtle) differences in data that was unaccounted for (I’ve seen more than one developer crash a server by using a regex which triggered catastrophic backtracking).
June 22nd, 2007 at 7:27 am
I like Scott Hanselman’s quote: You have a problem so you decide to solve it using regular expressions. Now you have two problems.
=)
June 22nd, 2007 at 7:34 am
The command-line example of comma insertion is missing a final + at the end of the (\d\d\d) group.
June 22nd, 2007 at 9:20 am
@Steve
PCRE stands for Perl Compatible Regular Expressions… it’s a specific flavor, you’re right, but it’s available in almost every programming language, not just PHP (and it’s certainly available in Perl, see the third paragraph of the post).
I removed the \b example, you’re right… that’s confusing. And honestly, how often do you need to match a backspace? Probably best just to avoid the confusion.
Re: portability: if you use the same “flavor” (i.e., PCRE) then you can, for the most part, use the same regex in different programming languages. What changes are some of the mechanics (e.g., how the regex is invoked, how capturing works, etc). I agree with you that using a regex that you don’t understand can be dangerous, which is why I wrote the tutorial! :)
@Matt
I’ve heard that quote before, but I think the problem is that a lot of programmers simply don’t understand regexes. If you compare a solid regex to the alternative (which often involves writing a complicated state machine to perform some simple parsing operation) the regex is often several orders of magnitude simpler, and more maintainable.
@Greg
Good find, I fixed it. Don’t know how that slipped through, thanks.
June 22nd, 2007 at 11:21 am
[…] ~ Regularni izrazi za napredne i one koji se tako osje?aju. […]
June 22nd, 2007 at 11:30 am
Hi Mike,
Liked the tutorial. I am an old Snobol/Spitbol programmer that still uses these languages today to do all my text parsing. (While the languages are old and few of you young guys know about them, Snobol at least, is free and open source.)
I keep looking at Reg Ex’s to see if i am missing anything. So far, no offense, but i don’t think so.
You raise some good points about parsing text and the need for forward and backward looking, as well as less greedy matching.
fwiw, here is the Snobol/Spitbol equvalent of the last perl example, if anyone is curious:
d = any(’0123456789′) ; num = ‘8927369280′ ; rline = reverse(num)
match rline ( d d d ) $ d3 d $ md = d3 ‘,’ md :s(match)
output = reverse(rline)
end
fwiw,
Russ
June 22nd, 2007 at 11:39 am
[…] Link […]
June 22nd, 2007 at 2:35 pm
Who knew (?!) actually meant something in a regex! ;-)
June 22nd, 2007 at 7:30 pm
[…] Extreme regex foo: what you need to know to become a regular expression pro - I’m Mike ‘.’ (dot) metacharacter to match all characters, including newlines. The ‘.’ metacharacter doesn’t match newlines by default, for mostly historical reasons. The original regex tools were Unix command line utilit (tags: regex) […]
June 22nd, 2007 at 7:35 pm
[…] Extreme regex foo what you need to know to become a regular expression pro (tags: dev geek tutorial regex) […]
June 22nd, 2007 at 7:36 pm
@Mike,
I’m guessing that by PCRE you mean a traditional NFA regex engine (which can indeed be used within all the programming languages you mentioned). PCRE, regardless of what it stands for, is a very specific regex library, and is not even fully compatible with Perl (take, as one example, PCRE’s recursion syntax, or Perl’s unique ability to embed code within regexes). If you compare PCRE to other NFAs like those used by JavaScript, the incompatibilities are even greater (e.g., JavaScript doesn’t support lookbehind, atomic groupings, possessive quantifiers, conditionals, Unicode properties, etc.)
June 22nd, 2007 at 7:41 pm
@Matt,
I think you are misattributing that quote. See this detailed post on Jeffrey Friedl’s blog: http://regex.info/blog/2006-09-15/247?nc3
June 23rd, 2007 at 1:27 am
[…] Extreme regex foo: what you need to know to become a regular expression pro (tags: regex regexp regular expressions programming tutorial howto development php perl tips system administration advanced code guide learning) […]
June 23rd, 2007 at 6:19 pm
[…] Extreme regex foo: what you need to know to become a regular expression pro - I’m Mike (tags: regex programming reference learning) […]
June 24th, 2007 at 11:02 am
[…] Extreme regex foo: what you need to know to become a regular expression pro - I’m Mike I want to be a complete geek, and until I can figure this regex crap out without having to look at 13 resources every time I try I will not be able to sleep. Okay, I’ll be able to sleep. I was kidding…about the sleeping part. (tags: regex reference) Posted in Daily Links by phil.leitch RSS 2.0 […]
June 25th, 2007 at 10:22 am
[…] Seeing that Mike Malone posted an advanced regex article, I decided to put up that short intro here. […]
June 27th, 2007 at 8:20 am
[…] Extreme regex foo: what you need to know to become a regular expression pro - I’m Mike (tags: regex programming tutorial howto) […]
June 28th, 2007 at 2:06 am
[…] Extreme regex foo: what you need to know to become a regular expression pro - I’m Mike […]
June 30th, 2007 at 2:51 am
Thanks Mike you have excelled yourself, there are heaps of tutorials out there on regular expressions, but unfortunately 99% of them ramble and are hard to follow.
July 16th, 2007 at 1:07 am
Hi, Thanks for article; The below expression : ]*\n?.*=(”"|’)?(.*\.jpg)(”"|’)?.*\n?[^
It returns path of all .jpg Image files in an Html document; But if the IMG tag is finishing at the next line then it does not return that image’s path ; For example,
Above only returns =”C:/ImageTemplates/close.jpg” ;
But when the COntent is:
Then :
1) =”C:/ImageTemplates/close.jpg”
2) ==”C:/ImageTemplates/blast.gif” are returned
What can do to modify in the expression ??
Any help will be appreciated.. Thanks.
July 21st, 2007 at 12:37 pm
Thanks for this article!
Vasudev Ram
http://www.dancingbison.com
jugad.livejournal.com
sourceforge.net/projects/xtopdf
March 26th, 2008 at 11:37 am
I looked for many web resources for regular expressions and their explanations but could not find any satisfying one. I read your all the three articles and really they are a very good resource I’ve found. The writing and explaining is really simple and intuitive. I love them all.