Fork PHP! (and speed up your scripts)

Linux, Programming, Tutorials 10 Comments »

I just came across a forum post discussing what the author calls “multithreaded PHP,” and thought I’d clear a few things up about concurrency in PHP. First of all, this is not multhreading. It’s not even “sort of multithreading,” as the author implies (no offense, it’s just not). What the author is actually doing is creating multiple processes. A process is not the same thing as a thread. In fact, PHP does not support multiple threads at all, and doesn’t plan to do so anytime soon.

Threads and processes both increase the concurrency of an application, meaning you can do more things at the same time. A good example of a program that would benefit from multithreading/multiple processes is a web server. A single process or thread can only handle one I/O operation at a time (that’s not completely accurate, but it simplifies things so whatever… you can read more here if you want to know the truth). I/O operations are s - l - o - w (in computer-terms, anyways). If we fork a second process, or create a new thread, we can handle two I/O operations simultaneously! This simple model is the way 90% of network servers work:

While( 1 )
  Process 1: Listen for connection
  Process 1: Connection made, fork Process 2
End

Process 2: Handle request, terminate.

That way nobody is turned away by our server because the process that is supposed to be listening for incoming connections is handling a request.

Concurrency can come in handy for client-side programs too from time to time. Take the PHP script I wrote to download YouTube videos, for example. The script connects to YouTube, finds the URLs of the relevant FLV files, and pushes each URL onto a stack. Then it pops each URL off the stack one by one and downloads the FLV file. But YouTube throttles the downloads to about 60K/s. That’s fine when you’re streaming the video in your browser since the bitrate is much lower than that. But it’s not cool when you’re on a box with a 100mbit connection! So what can we do about it?

Well, we can’t change the fact that downloads from YouTube are capped at 60K/s, but we can download more than one movie at once by forking additional processes. And it turns out there is an easy way to do it built right in to PHP. The solution is the pcntl_fork() function, which provides an interface to the underlying fork syscall (only available on *nix platforms, sorry Windows users).

The trick to the fork syscall is that it returns twice. Once in the parent, and once in the child. It’s confusing at first, but if you think about it, it makes sense. In the parent the return value is the new child’s process ID. In the child it’s zero. Thus, we can use a simple if statement to determine whether we’re in the parent or the child process, which will usually determine the execution path we follow.

The changes that I had to make to the youtube script were almost trivial. The general pattern looks like this:

$pid = pcntl_fork();

if($pid == -1) {
  die('could not fork')
}
else if ($pid) {
  // positive value means we're in the parent.
  // do whatever parents do
  ....
  // wait for children to complete by calling
  // pcntl_wait() or a variant
} else {
  // zero value means we're in the child.
  // do whatever children do
  // (e.g. download the files, then exit)
}

Check out the source code here (plaintext version), and compare it to the previous version (plaintext version). The differences are minimal, but the forked version can substantially increase the execution speed if you’re downloading several files on a fast connection.

YouTube Video Ripper in PHP

Hacks, Programming 24 Comments »

I found an interesting shell script today while browsing digg that allows you to download all the youtube videos for a particular user that match a certain pattern. This is a great example of the power of regular expressions, by the way… I wrote a quick PHP port (plaintext version) that you can use if you’re running Windows. It doesn’t have a progress bar (since it’s not using wget), but it gets the job done.

Update: if you don’t have the PHP interpreter installed either, you can go to this page to generate direct links to the videos that would have been downloaded by the script automatically. I just threw it together in 5 minutes, so it’s not pretty. But again, it works. Just right click and “save as.”

read more

Hacking Google Spell Checker for Fun and Profit

Google, Hacks, Programming 35 Comments »

Try it out!

 


A few days ago I was researching ways to integrate spell checking with the search engine for a project I’m working on similar to the way Google does. I figured Google, being Google, must have some legitimate mechanism for accessing their spell checker (this is Web 2.0, after all).

After scouring the Internet for some time all I could find was a deprecated SOAP web service that used to be available as part of their SOAP search API. Unfortunately they stopped issuing API keys for the SOAP Search API on December 5, 2006. The ajax search API that replaced it doesn’t seem to provide spelling corrections. Bummer.

Just as I was about to give up I stumbled across an interesting blog post that describes a publicly available (but undocumented and apparently not very widely known) RPC endpoint that Google uses to provide spelling corrections for the Google Toolbar. The URL is https://www.google.com/tbproxy/spell.

Neat. After a few minutes of tinkering I put together a small class in PHP that provides easy access to the service. The class requires SimpleXML and CURL. It defines two static methods, SpellChecker::Check() (which returns true if the query you pass as an argument is spelled correctly) and SpellChecker::Correct() (which returns Google’s suggested spelling). You can download the source here (plaintext version), or try it out with the AJAX spell checker I threw together (up top).

Here’s a quick replay of a typical request/response (I wrapped the XML, but in theory it shouldn’t matter):

POST /tbproxy/spell?lang=en&hl=en HTTP/1.0
MIME-Version: 1.0
Content-type: application/PTI26
Content-length: 125
Content-transfer-encoding: text
Request-number: 1
Document-type: Request
Interface-Version: Test 1.4
Connection: close 

<spellrequest
  textalreadyclipped="0"
  ignoredups="1"
  ignoredigits="1"
  ignoreallcaps="0">
    <text>gogle spel</text>
</spellrequest>

HTTP/1.0 200 OK
Content-Type: text/xml
Server: DocumentSpellcheck
Cache-Control: private, x-gzip-ok=""
Date: Sat, 07 Apr 2007 14:11:57 GMT
Connection: Close

<?xml version="1.0"?>
<spellresult
  error="0"
  clipped="0"
  charschecked="10">
    <c o="0" l="5" s="1">google    Google  goggle  giggle  Gogol</c>
    <c o="6" l="4" s="1">spell       spiel   spelt   spew    Opel</c>
</spellresult>

The suggestions are tab-delineated. The ‘o’ attribute is an offset from the start of your query to the misspelled word. ‘l’ is the length of the misspelled word. ’s’ is the confidence of Google’s suggestion (presumably higher is better, but I’ve only gotten 0 or 1).

5 Regular Expressions Every Web Programmer Should Know

Lists, Programming, Tutorials 55 Comments »

I’m going to assume you have a basic understanding of regular expressions at this point. If you’re a regex n00b (or /n0{2}b/, as I like to call them), or if you need a quick refresher, check out my previous post on the absolute bare minimum that every programmer should know about regular expressions. You won’t be disappointed.

So, without further adu, here are the five regular expressions that I have found the most useful for day-to-day web programming tasks.

Matching a username

This one’s quite easy, but it’s really invaluable if you’re trying to build a user registration system for a website. We typically want to limit usernames to a restricted set of characters in order to make development easier, and to keep malicious users from spoofing someone else’s name (e.g. replacing a space with multiple spaces or a newline character, which are all displayed the same by a web browser).

Without regular expressions, this would be a tedious exercise that would involve splitting the string into it’s component characters and examining each one individually. With regular expressions, it’s a breeze. First, let’s define what we want to accept, we’ll keep it simple and limit the example to the following characters:

  1. Alphanumeric characters (letters and numbers)
  2. The underscore character (_)

We’ll also want to enforce a 3 character minimum and a 16 character maximum length. Here’s the regular expression that matches this fairly standard set of criteria:

/[a-zA-Z0-9_]{3,16}/

If you’re familiar with regular expressions you may have notice something missing at this point - don’t worry, I’ll get to it.

If you’ve read my introductory to regular expressions you should already know how this regex works. First we’re defining a character class that will match any letters (a through z, and A through Z) and any numbers (0 through 9), as well as the _ (underscore) character. Next comes an interval quantifier that tells the regex engine we’ll only match sequences of between 3 and 16 characters. Because the quantifier follows a character class rather than a single character it attaches itself to the entire class, and will match every sequence between 3 and 16 characters so long as each character falls within our restricted character set.

So what’s missing? As it stands our regex will match anywhere within a string. It won’t just match ‘mike_84′, it will also match any ‘%! mike_84&’, which contains several characters we don’t want. What we need are anchors, the ^ (caret) and $ (dollar) characters will anchor our regex to the beginning and end of the string, ensuring that the whole username meets our requirements and not just a portion of it.

So our revised regex will look like this:

/^[a-zA-Z0-9_]{3,16}$/

Here’s a quick PHP code snippet that shows how we can use this regex in production (we could just as easily use perl, java, ruby, or even javascript to do this validation).

function validate_username( $username ) {
  if(preg_match('/^[a-zA-Z0-9_]{3,16}$/', $_GET['username'])) {
    return true;
  }
  return false;
}

Matching an XHTML/XML tag

Matching an XML or XHTML tag can be extremely useful if you’re scraping a website for data, or trying to quickly extract information from an XML document. A simple regex to accomplish this sort of extraction follows this form (the word ‘tag’ should be replaced with whatever tag you are looking for):

{<tag[^>]*>(.*?)</tag>}

The question mark following the star turns the start into a lazy quantifier. By default, quantifiers are greedy, meaning they’ll consume as much of the input text as they can. Lazy quantifiers, by contrast, will match as little of the input text as they can. If we used a greedy quantifier in this case, our regex would not work as advertised on an input document like

<tag>item 1</tag><tag>item 2</tag>

Instead of matching a single tag, a greedy quantifier would match up to the final closing tag in the input text.

Here’s a simple PHP function to extract the contents of each matching XML or XHTML tag as an array:

function get_tag( $tag, $xml ) {
  $tag = preg_quote($tag);
  preg_match_all('{<'.$tag.'[^>]*>(.*?)</'.$tag.'>.'}',
                   $xml,
                   $matches,
                   PREG_PATTERN_ORDER);

  return $matches[1];
}

Matching an XHTML/XML tag with a certain attribute value (e.g. class or tag)

This regex is very similar to the last example, except we only want tags with a certain attribute value. This comes in handy when you want to extract a tag with a particular class or ID value, for example. The regex is just slightly more complicated than our previous example (again, replace tag, attribute, and value with whatever you’re looking for):

{<tag[^>]*attribute\\s*=\\s*(["'])value\\\\1[^>]*>(.*?)</tag>}

We use a character class to allow either single or double quotes around our value. The portion of the regex following the value is called a backreference. It will be replaced with whatever is captured by the first set of parenthesis in the expression (either a single quote or double quote). That way we can be sure that the opening and closing quotes match.

Here’s a PHP function that shows how you can extract information form an XHTML document with this regex. The function tags an attribute, value, input text, and an optional tag name as arguments. If no tag name is specified it will match any tag with the specified attribute and attribute value.

function get_tag( $attr, $value, $xml, $tag=null ) {
  if( is_null($tag) )
    $tag = '\\w+';
  else
    $tag = preg_quote($tag);

  $attr = preg_quote($attr);
  $value = preg_quote($value);

  $tag_regex = "/<(".$tag.")[^>]*$attr\\s*=\\s*".
                "(['\\"])$value\\\\2[^>]*>(.*?)<\\/\\\\1>/"

  preg_match_all($tag_regex,
                 $xml,
                 $matches,
                 PREG_PATTERN_ORDER);

  return $matches[3];
}

Matching and parsing an email address

This one comes courtesy of Cal Henderson, the programmer behind Flickr and author of Building Scalable Web Sites (a great read). For more information check out Cal’s article on parsing email addresses.

This one’s such a behemoth that it’s easier to digest when broken into it’s component parts. Constructing a regex like this is a bit like describing a grammer in Backus-Naur form (BNF), which is convenient because many of the things we’re trying to match are already described using BNF in their specifications. This is the case for email addresses, which are described in RFC 822. Anyways, here’s a PHP function that will check the validity of an e-mail address:

function is_valid_email_address($email){
  $qtext = '[^\\x0d\\x22\\x5c\\x80-\\xff]';
  $dtext = '[^\\x0d\\x5b-\\x5d\\x80-\\xff]';
  $atom = '[^\\x00-\\x20\\x22\\x28\\x29\\x2c\\x2e\\x3a-\\x3c'.
                  '\\x3e\\x40\\x5b-\\x5d\\x7f-\\xff]+';
  $quoted_pair = '\\x5c[\\x00-\\x7f]';
  $domain_literal = "\\x5b($dtext|$quoted_pair)*\\x5d";
  $quoted_string = "\\x22($qtext|$quoted_pair)*\\x22";
  $domain_ref = $atom;
  $sub_domain = "($domain_ref|$domain_literal)";
  $word = "($atom|$quoted_string)";
  $domain = "$sub_domain(\\x2e$sub_domain)*";
  $local_part = "$word(\\x2e$word)*";
  $addr_spec = "$local_part\\x40$domain";

  return preg_match("!^$addr_spec$!", $email) ? 1 : 0;
}

The ‘\x##’ sequences are hexadecimal character references. It’s just a fancy way of specifying a character using it’s underlying code point (the numerical representation of a particular symbol). Otherwise this is a fairly straightforward, albeit incredibly complex regular expression. I’ll refrain from any further analysis since it’s been done elsewhere.

Tim Fletcher has ported Cal’s original PHP function into Ruby and Perl, if that’s what you’re into..

Matching a URL

Matching a URL is a lot like matching an e-mail address, except that you tend to do it in more controlled, and less critical situations where you can tolerate a few false positives. I use this regex frequently in projects when I need to automatically generate links when a URL is typed in a comment field, for example. Like the email regex, this one’s a doozy, but it’s pretty easy to understand.

I’ve taken advantage of the ‘x’ and ‘i’ pattern modifiers for this regex. Pattern modifiers are tacked onto the end of a regex and change the way the regex engine interprets the expression. The ‘x’ modifier tells the engine to ignore whitespace, except when escaped or used inside of a character class. It also tells the engine to interpret any text following a ‘#’ character outside of a character class as a comment (i.e. ignore it). The ‘i’ modifier makes the regex case insensitive. This can drastically simplify a complicated regex like this one when case doesn’t matter. This regex is derived from one developed by Jeffrey Friedl in his book Mastering Regular Expressions.

{
  \\b
  # Match the leading part (proto://hostname, or just hostname)
  (
    # http://, or https:// leading part
    (https?)://[-\\w]+(\\.\\w[-\\w]*)+
  |
    # or, try to find a hostname with more specific sub-expression
    (?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \\. )+ # sub domains
    # Now ending .com, etc. For these, require lowercase
    (?-i: com\\b
        | edu\\b
        | biz\\b
        | gov\\b
        | in(?:t|fo)\\b # .int or .info
        | mil\\b
        | net\\b
        | org\\b
        | [a-z][a-z]\\.[a-z][a-z]\\b # two-letter country code
    )
  )

  # Allow an optional port number
  ( : \\d+ )?

  # The rest of the URL is optional, and begins with /
  (
    /
    # The rest are heuristics for what seems to work well
    [^.!,?;"\\'<>()\[\]\{\}\s\x7F-\\xFF]*
    (
      [.!,?]+ [^.!,?;”\\’<>()\\[\\]\{\\}\s\\x7F-\\xFF]+
    )*
  )?
}ix

The comments in this expression are fairly self explanatory, so I don’t think it needs a whole lot of explanation. There are a few things to watch out for though. First, this regex will match some things that are not valid URLs. The regex assumes that any two-letter combination is a valid top-level domain (TLD), which is not the case. It also misses TLDs that were recently added to the IANA list like .travel, .name, and .museum. You can fix this by downloading the latest IANA TLD list and adding any missing TLDs in the list of alternatives mid-way through the expression.

That being said, this regex works great 99.9% of the time. Here’s a quick PHP function that will parse a section of text, replacing any URLs it finds with links. I’m going to assume you’ve set the variable $url_regex to the the above pattern so I don’t have to repeat it here.

function auto_link( $text ) {
  $url_regex = ...

  return preg_replace( $url_regex,
                         '<a href="$0"^gt;$0=</a>',
                         $text );
}

So that’s it. If you think I left something off the list that deserves mention, or if you have any suggestions for improvements, post a comment and let me know.

The absolute bare minimum every programmer should know about regular expressions

Programming, Tutorials 44 Comments »

Wtf is a regular expression?

Regular expressions are strings formatted using a special pattern notation that allow you to describe and parse text. Many programmers (even some good ones) disregard regular expressions as line noise, and it’s a damned shame because they come in handy so often. Once you’ve gotten the hang of them, regular expressions can be used to solve countless real world problems.

Regular expressions work a lot like the filename “globs” in Windows or *nix that allow you to specify a number of files by using the special * or ? characters (oops, did I just use a glob while defining a glob?). But with regular expressions the special characters, or metacharacters, are far more expressive.

Like globs, regular expressions treat most characters as literal text. The regular expression mike, for example, will only match the letters m - i - k - e, in that order. But regular expressions sport an extensive set of metacharacters that make the simple glob look downright primitive.

Meet the metacharacters: ^[](){}.*?\|+$ and sometimes -

I know, they look intimidating, but they’re really nice characters once you get to know them.

The Line Anchors: ‘^’ and ‘$’

The ‘^’ (caret) and ‘$’ (dollar) metacharacters represent the start and end of a line of text, respectively. As I mentioned earlier, the regular expression mike will match the letters m - i - k - e, but it will match anywhere in a line (e.g. it would match “I’m mike”, or even “carmike”). The ‘^’ character is used to anchor the match to the start of the line, so ^mike would only find lines that start with mike. Similarly, the expression mike$ would only find m - i - k - e at the end of a line (but would still match ‘carmike’).

If we combine the two line anchor characters we can search for lines of text that contain a specific sequence of characters. The expression ^mike$ will only match the word mike on a line by itself - nothing more, nothing less. Similarly the expression ^$ is useful for finding empty lines, where the beginning of the line is promptly followed by the end.

The Character Class: ‘[]’

Square brackets, called a character class, let you match any one of several characters. Suppose you want to match the word ‘gray’, but also want to find it if it was spelled ‘grey’. A character class will allow you to match either. The regular expression gr[ea]y is interpreted as “g, followed by r, followed by either an e or an a, followed by y.”

If you use [^ ... ] instead of [ ... ], the class matches any character that isn’t listed. The leading ^ “negates” the list. Instead of listing all of the characters you want to included in the class, you list the characters you don’t want included. Note that the ^ (caret) character used here has a different meaning when it’s used outside of a character class, where it is used to match the beginning of a line.

The Character Class Metacharacter: ‘-’

Within a character-class, the character-class metacharacter ‘-’ (dash) indicates a range of characters. Instead of [01234567890abcdefABCDEF] we can write [0-9a-fA-F]. How convenient. The dash is a metacharacter only within a character class, elsewhere it simply matches the normal dash character.

But wait, there’s a catch. If a dash is the first character in a character class it is not considered a metacharacter (it can’t possibly represent a range, since a range requires a beginning and an end), and will match a normal dash character. Similarly, the question mark and period are usually regex metacharacters, but not when they’re inside a class (in the class [-0-9.?] the only special character is the dash between the 0 and 9).

Matching Any Character With a Dot: ‘.’

The ‘.’ metacharacter (called a dot or point) is shorthand for a character class that matches any character. It’s very convenient when you want to match any character at a particular position in a string. Once again, the dot metacharacter is not a metacharacter when it’s inside of a character class. Are you beginning to see a pattern? The list of metacharacters is different inside and outside of a character class.

The Alternation Metacharacter: ‘|’

The ‘|’ metacharacter, (pipe) means “or.” It allows you to combine multiple expressions into a single expression that matches any of the individual ones. The subexpressions are called alternatives.

For example, Mike and Michael are separate expressions, but Mike|Michael is one expression that matches either.

Parenthesis can be used to limit the scope of the alternatives. I could shorten our previous expression that matched Mike or Michael with creative use of parenthesis. The expression Mi(ke|chael) matches the same thing. I probably would use the first expression in practice, however, as it is more legible and therefore more maintainable.

Matching Optional Items: ‘?’

The ‘?’ metacharacter (question mark) means optional. It is placed after a character that is allowed, but not required, at a certain point in an expression. The question mark attaches only to the immediately preceding character.

If I wanted to match the english or american versions of the word ‘flavor’ I could use the regex flavou?r, which is interpreted as “f, followed by l, followed by a, followed by v, followed by o, followed by an optional u, followed by r.”

The Other Quantifiers: ‘+’ and ‘*’

Like the question mark, the ‘+’ (plus) and ‘*’ (star) metacharacters affect the number of times the preceding character can appear in the expression (with ‘?’ the preceding character could appear 0 or 1 times). The metacharacter ‘+’ matches one or more of the immediately preceding item, while ‘*’ matches any number of the preceding item, including 0.

If I was trying to determine the score of a soccer match by counting the number of times the announcer said the word ‘goal’ in a transcript, I might use the regular expression go+al, which would match ‘goal’, as well as ‘gooooooooooooooooal’ (but not ‘gal’).

The three metacharacters, question mark, plus, and star are called quantifiers because they influence the quantity of the item they’re attached to.

The Interval Quantifier: ‘{}’

The ‘{min, max}’ metasequence allows you to specify the number of times a particular item can match by providing your own minimum and maximum. The regex go{1,5}al would limit our previous example to matching between one and five o’s. The sequence {0,1} is identical to a question mark.

The Escape Character: ‘\’

The ‘\’ metacharacter (backslash) is used to escape metacharacters that have special meaning so you can match them in patterns. For example, if you would like to match the ‘?’ or ‘\’ characters, you can precede them with a backslash, which removes their meaning: ‘\?’ or ‘\\’.

When used before a non-metacharacter a backslash can have a different meaning depending on the flavor of regular expression you’re using. For perl compatible regular expressions (PCREs) you can check out the perldoc page for perl regular expressions. PCREs are extremely common, this flavor of regexes can be used in PHP, Ruby, and ECMAScript/Javascript, and many other languages.

Using Parenthesis for Matching: ‘()’

Most regular expression tools will allow you to capture a particular subset of an expression with parenthesis. I could match the domain portion of a URL by using an expression like http://([^/]+). Let’s break this expression down into it’s components to see how it works.

The beginning of the expression is fairly straightforward: it matches the sequence “h - t - t - p - : - / - /”. This initial sequence is followed by parenthesis, which are used to capture the characters that match the subexpression they surround. In this case the subexpression is ‘[^/]+’, which matches any character except for ‘/’ one or more times. For a URL like http://immike.net/blog/Some-blog-post, ‘immike.net’ will be captured by the parenthesis.

Want to know more?

I’ve only touched the surface on what can be done with regular expressions. If want to learn more, check out Jeffrey Friedl’s book Mastering Regular Expressions. Friedl did an outstanding job with this book, it’s accessible and a surprisingly fun and interesting read, considering the dry subject matter.

Check out my follow up to this article where I take a look at some of the most useful regular expressions for common programming tasks. And once you understand the basics read on to learn all you need to know to become a regex pro.

Copyright © 2007 - Mike Malone / Icons by N.Design Studio
Entries RSS Comments RSS Log in
no image