The absolute bare minimum every programmer should know about regular expressions
Programming, Tutorials April 6th, 2007 - 71,309 viewsWtf is a regular expression?
Regular expressions are strings formatted using a special pattern notation that allow you to describe and parse text. Many programmers (even some good ones) disregard regular expressions as line noise, and it’s a damned shame because they come in handy so often. Once you’ve gotten the hang of them, regular expressions can be used to solve countless real world problems.
Regular expressions work a lot like the filename “globs” in Windows or *nix that allow you to specify a number of files by using the special * or ? characters (oops, did I just use a glob while defining a glob?). But with regular expressions the special characters, or metacharacters, are far more expressive.
Like globs, regular expressions treat most characters as literal text. The regular expression mike, for example, will only match the letters m - i - k - e, in that order. But regular expressions sport an extensive set of metacharacters that make the simple glob look downright primitive.
Meet the metacharacters: ^[](){}.*?\|+$ and sometimes -
I know, they look intimidating, but they’re really nice characters once you get to know them.
The Line Anchors: ‘^’ and ‘$’
The ‘^’ (caret) and ‘$’ (dollar) metacharacters represent the start and end of a line of text, respectively. As I mentioned earlier, the regular expression mike will match the letters m - i - k - e, but it will match anywhere in a line (e.g. it would match “I’m mike”, or even “carmike”). The ‘^’ character is used to anchor the match to the start of the line, so ^mike would only find lines that start with mike. Similarly, the expression mike$ would only find m - i - k - e at the end of a line (but would still match ‘carmike’).
If we combine the two line anchor characters we can search for lines of text that contain a specific sequence of characters. The expression ^mike$ will only match the word mike on a line by itself - nothing more, nothing less. Similarly the expression ^$ is useful for finding empty lines, where the beginning of the line is promptly followed by the end.
The Character Class: ‘[]’
Square brackets, called a character class, let you match any one of several characters. Suppose you want to match the word ‘gray’, but also want to find it if it was spelled ‘grey’. A character class will allow you to match either. The regular expression gr[ea]y is interpreted as “g, followed by r, followed by either an e or an a, followed by y.”
If you use [^ ... ] instead of [ ... ], the class matches any character that isn’t listed. The leading ^ “negates” the list. Instead of listing all of the characters you want to included in the class, you list the characters you don’t want included. Note that the ^ (caret) character used here has a different meaning when it’s used outside of a character class, where it is used to match the beginning of a line.
The Character Class Metacharacter: ‘-’
Within a character-class, the character-class metacharacter ‘-’ (dash) indicates a range of characters. Instead of [01234567890abcdefABCDEF] we can write [0-9a-fA-F]. How convenient. The dash is a metacharacter only within a character class, elsewhere it simply matches the normal dash character.
But wait, there’s a catch. If a dash is the first character in a character class it is not considered a metacharacter (it can’t possibly represent a range, since a range requires a beginning and an end), and will match a normal dash character. Similarly, the question mark and period are usually regex metacharacters, but not when they’re inside a class (in the class [-0-9.?] the only special character is the dash between the 0 and 9).
Matching Any Character With a Dot: ‘.’
The ‘.’ metacharacter (called a dot or point) is shorthand for a character class that matches any character. It’s very convenient when you want to match any character at a particular position in a string. Once again, the dot metacharacter is not a metacharacter when it’s inside of a character class. Are you beginning to see a pattern? The list of metacharacters is different inside and outside of a character class.
The Alternation Metacharacter: ‘|’
The ‘|’ metacharacter, (pipe) means “or.” It allows you to combine multiple expressions into a single expression that matches any of the individual ones. The subexpressions are called alternatives.
For example, Mike and Michael are separate expressions, but Mike|Michael is one expression that matches either.
Parenthesis can be used to limit the scope of the alternatives. I could shorten our previous expression that matched Mike or Michael with creative use of parenthesis. The expression Mi(ke|chael) matches the same thing. I probably would use the first expression in practice, however, as it is more legible and therefore more maintainable.
Matching Optional Items: ‘?’
The ‘?’ metacharacter (question mark) means optional. It is placed after a character that is allowed, but not required, at a certain point in an expression. The question mark attaches only to the immediately preceding character.
If I wanted to match the english or american versions of the word ‘flavor’ I could use the regex flavou?r, which is interpreted as “f, followed by l, followed by a, followed by v, followed by o, followed by an optional u, followed by r.”
The Other Quantifiers: ‘+’ and ‘*’
Like the question mark, the ‘+’ (plus) and ‘*’ (star) metacharacters affect the number of times the preceding character can appear in the expression (with ‘?’ the preceding character could appear 0 or 1 times). The metacharacter ‘+’ matches one or more of the immediately preceding item, while ‘*’ matches any number of the preceding item, including 0.
If I was trying to determine the score of a soccer match by counting the number of times the announcer said the word ‘goal’ in a transcript, I might use the regular expression go+al, which would match ‘goal’, as well as ‘gooooooooooooooooal’ (but not ‘gal’).
The three metacharacters, question mark, plus, and star are called quantifiers because they influence the quantity of the item they’re attached to.
The Interval Quantifier: ‘{}’
The ‘{min, max}’ metasequence allows you to specify the number of times a particular item can match by providing your own minimum and maximum. The regex go{1,5}al would limit our previous example to matching between one and five o’s. The sequence {0,1} is identical to a question mark.
The Escape Character: ‘\’
The ‘\’ metacharacter (backslash) is used to escape metacharacters that have special meaning so you can match them in patterns. For example, if you would like to match the ‘?’ or ‘\’ characters, you can precede them with a backslash, which removes their meaning: ‘\?’ or ‘\\’.
When used before a non-metacharacter a backslash can have a different meaning depending on the flavor of regular expression you’re using. For perl compatible regular expressions (PCREs) you can check out the perldoc page for perl regular expressions. PCREs are extremely common, this flavor of regexes can be used in PHP, Ruby, and ECMAScript/Javascript, and many other languages.
Using Parenthesis for Matching: ‘()’
Most regular expression tools will allow you to capture a particular subset of an expression with parenthesis. I could match the domain portion of a URL by using an expression like http://([^/]+). Let’s break this expression down into it’s components to see how it works.
The beginning of the expression is fairly straightforward: it matches the sequence “h - t - t - p - : - / - /”. This initial sequence is followed by parenthesis, which are used to capture the characters that match the subexpression they surround. In this case the subexpression is ‘[^/]+’, which matches any character except for ‘/’ one or more times. For a URL like http://immike.net/blog/Some-blog-post, ‘immike.net’ will be captured by the parenthesis.
Want to know more?
I’ve only touched the surface on what can be done with regular expressions. If want to learn more, check out Jeffrey Friedl’s book Mastering Regular Expressions. Friedl did an outstanding job with this book, it’s accessible and a surprisingly fun and interesting read, considering the dry subject matter.
Check out my follow up to this article where I take a look at some of the most useful regular expressions for common programming tasks. And once you understand the basics read on to learn all you need to know to become a regex pro.
April 6th, 2007 at 4:09 pm
[…] – More – […]
April 6th, 2007 at 8:32 pm
[…] /n0{2}b/, as I like to call them), or if you need a quick refresher, check out my previous post on the absolute bare minimum that every programmer should know about regular expressions. You won’t be […]
April 8th, 2007 at 12:00 pm
[…] videos for a particular users that match a certain pattern. This is a great example of the power of regular expressions, by the way… I wrote a quick PHP port (plaintext version) that you can use if you’re […]
May 5th, 2007 at 11:18 am
[…] Malone put up an excellent introduction to Regular Expressions that I think anyone that is rusty or curious should take a look at. He took the most basic concepts […]
May 5th, 2007 at 11:22 am
Mike, thank you for writing this up. It’s an excellent resource especially for folks that dabble in regexp every few years and always have to re-learn it right before using it again. This is going into my programming book marks.
May 5th, 2007 at 2:40 pm
Excellent write-up, the only things I would add would be:
* a tad more info on captures (say, a simple example of how the capture could be used)
* the concept of “greedy” (possibly a few other modifiers such as caseless, multiline)
Those may be too advanced for what you’re aiming for here, but regardless, nice article!
May 5th, 2007 at 2:57 pm
@Riyad: Thanks, I’m glad you found it useful.
@kenman
I debated what was worth including and what should be left out. I briefly touch on greedy vs. lazy quantifiers and capturing in my follow-up article, but more could definitely be said.
Some other important concepts are missing as well, such as positive & negative lookahead / lookbehind, non-capturing parenthesis, some discussion of character sets (ensuring that your regex is utf8 safe), assertions (other than beginning/end of line), pattern modifiers, and probably more. I might have to write a follow-up on the more advanced regex concepts at some point.
May 5th, 2007 at 9:48 pm
I would definitely be interested in a good article those ‘other important concepts’ you mention- most of those you mentioned are still new to me, or are completely beyond my learning thus far. I only recently discovered named captures i.e. /(?P[A-Za-z]+)/, and have loved it for it’s self-documenting properties. Here’s one vote for an advanced article :) (p.s. your follow-up is good stuff as well!)
May 5th, 2007 at 9:49 pm
Comment formatting ate my example above :\
May 6th, 2007 at 8:14 am
Great introduction to regexps!
May 6th, 2007 at 10:03 am
Nice article.
Every UNIX system administrator should be well versed in regular expressions. I use them all the time.
May 6th, 2007 at 11:33 am
Thanks for the excellent resource
May 6th, 2007 at 2:36 pm
This is the most straight forward tutorial on regular expressions I’ve found. And the included examples help out a lot.
Thanks!
May 6th, 2007 at 10:27 pm
Lovely write-up. If only this was around a year ago when I started to learn regexp.
May 7th, 2007 at 1:27 am
[…] The absolute bare minimum every programmer should know about regular expressions Regex Tutorial Regular Expressions Cheat Sheet regular-expressions.info Learn Regular Expression (Regex) syntax with C# and .NET […]
May 7th, 2007 at 2:35 am
“Some people, when confronted with a problem, think
“I know, I’ll use regular expressions!” Now they have two problems.” –Jamie Zawinski
Seriously, regular expressions are useful, but not pretty. You speak at one point about one regexp being more maintainable than another. But if you find yourself in a position where you actually need to do maintenance on your regexps, I wonder if you haven’t fallen into Jamie’s trap.
May 7th, 2007 at 4:41 am
I’ve found the article very useful and I would like to make a spanish translation. Of course I would acknowledge you as the author and would link to the original post, but first I want to ask your permission.
May 7th, 2007 at 6:37 am
I have seen a little misprint: were it says “o ^mike would only find words that start with mike”, I think it should say “lines” instead of “words”.
May 7th, 2007 at 9:38 am
@Jose
You can absolutely do a Spanish translation. That would be really cool. Let me know what the URL is when you’re done and I’ll link to it.
I’ve fixed the misprint you found, thanks for the heads up.
@Harald
I guess some people have problems understanding regular expressions, which makes them difficult to use and maintain. My goal with this post was to show that they really aren’t that difficult to understand once you get the hang of them. I’ve gotten to the point where I can look at a pretty complicated regular expression and read it almost like a sentence.
As with any choice (programming related or not) there are pros and cons to using regular expressions, but I think they are an incredibly useful tool that every programmer should have at their disposal.
May 7th, 2007 at 10:02 am
[…] Este post es una traducción de un excelente artículo de Mike Malone, “The Absolute Bare Minimum Every Programmer Should Know About Regular Expressions“ […]
May 7th, 2007 at 1:54 pm
It might be useful to add the {n,} form of interval quantifier.
That aside, great article.
May 7th, 2007 at 3:34 pm
[…] Here is a post to cover the bare minimum you need to know to work with regular expressions. […]
May 9th, 2007 at 3:14 pm
[…] The absolute bare minimum every programmer should know about regular expressions […]
May 12th, 2007 at 7:49 pm
[…] check out Mike’s article on regular expressions and learn how to make your job as a programmer just a little more interesting and a lot easier. […]
June 21st, 2007 at 2:32 pm
[…] If you’re new to regular expressions, or if you could use a quick refresher, go read my intro to regular expressions, and work through a few examples. Trust me, it’ll be one of the most rewarding twenty minutes […]
June 22nd, 2007 at 8:25 am
[…] Introduction to Regular Expressions link […]
June 23rd, 2007 at 9:08 am
Is regexp different from PCRE? Does regexp differ for unix and windows ?
June 24th, 2007 at 7:17 pm
@Nitin: The PCRE library is one implementation of regular expressions. There should be no difference between unix and windows if the same regex library is being used.
June 25th, 2007 at 10:26 am
[…] expression only requires you know a few rules to get you started. Check out The absolute bare minimum every programmer should know about regular expressions as a […]
June 27th, 2007 at 1:21 pm
[…] The absolute bare minimum every programmer should know about regular expressions (tags: regex development) […]
July 18th, 2007 at 7:04 am
[…] The absolute bare minimum every programmer should know about regular expressions […]
July 21st, 2007 at 2:19 am
[…] [links] […]
July 23rd, 2007 at 3:24 am
Great Primer for regexp.
July 23rd, 2007 at 3:32 am
[…] Regular Expressions - Primer […]
July 24th, 2007 at 11:36 pm
[…] The absolute bare minimum every programmer should know about regular expressions - I’m Mike (tags: regex programming regexp tutorial reference development tutorials) […]
August 10th, 2007 at 6:55 pm
Good write-up. Definitely bookmark-worthy material, perfect for those quick “i forgot the syntax” moments we all have sometimes. Good job.
September 18th, 2007 at 4:53 pm
Even using any imaginative meaning of the word “programmer” could I find a place for myself under that category. But I do what I can and keep trying. Now that my skill set is laid bare I actually understood what you said. Can’t say how it will help me but just understanding some of this helps. You sir have a gift because it is dry material and it was an easy read regardless. Thanks for restoring some confidence! Keep explaning please.
September 21st, 2007 at 7:35 am
i m in need of a script which would transform a xml file to pipe delimited flatfile…
say
input ex.xml
One
two
required output is
tag1|tag2
one|two
can you please provide a script that helps me…
thanks….
October 6th, 2007 at 9:55 pm
[…] most dry regular expressions tutorials out there (Mike’s smart post aside), I intend to provide more than just the “what”; I’ll walk you through the […]
March 2nd, 2008 at 10:59 am
[…] metin Mike Malone tarafından yazılmış olan özgün metnin tarafımdan yapılmış olan çevirisidir. Eğer çeviride hata olduğunu […]
March 2nd, 2008 at 11:01 am
Hi Mike, I’ve translated this article in Turkish [0]. Best regards, Ali Servet Dönmez.
[0]: http://www.pittle.org/weblog/her-programcinin-duzenli-deyimler-hakkinda-kesinlikle-en-azindan-bilmesi-gerekenler_228.html
March 30th, 2008 at 12:10 pm
[…] Malone put up an excellent introduction to Regular Expressions that I think anyone that is rusty or curious should take a look at. He took the most basic concepts […]
April 10th, 2008 at 9:16 am
[…] The absolute bare minimum every programmer should know about regular expressions […]
April 14th, 2008 at 4:26 am
Nice write up, this’ll help a lot.
Thanks!