Hacking Google Spell Checker for Fun and Profit
Google, Hacks, Programming April 7th, 2007 - 19,648 viewsTry it out!
A few days ago I was researching ways to integrate spell checking with the search engine for a project I’m working on similar to the way Google does. I figured Google, being Google, must have some legitimate mechanism for accessing their spell checker (this is Web 2.0, after all).
After scouring the Internet for some time all I could find was a deprecated SOAP web service that used to be available as part of their SOAP search API. Unfortunately they stopped issuing API keys for the SOAP Search API on December 5, 2006. The ajax search API that replaced it doesn’t seem to provide spelling corrections. Bummer.
Just as I was about to give up I stumbled across an interesting blog post that describes a publicly available (but undocumented and apparently not very widely known) RPC endpoint that Google uses to provide spelling corrections for the Google Toolbar. The URL is https://www.google.com/tbproxy/spell.
Neat. After a few minutes of tinkering I put together a small class in PHP that provides easy access to the service. The class requires SimpleXML and CURL. It defines two static methods, SpellChecker::Check() (which returns true if the query you pass as an argument is spelled correctly) and SpellChecker::Correct() (which returns Google’s suggested spelling). You can download the source here (plaintext version), or try it out with the AJAX spell checker I threw together (up top).
Here’s a quick replay of a typical request/response (I wrapped the XML, but in theory it shouldn’t matter):
POST /tbproxy/spell?lang=en&hl=en HTTP/1.0
MIME-Version: 1.0
Content-type: application/PTI26
Content-length: 125
Content-transfer-encoding: text
Request-number: 1
Document-type: Request
Interface-Version: Test 1.4
Connection: close
<spellrequest
textalreadyclipped="0"
ignoredups="1"
ignoredigits="1"
ignoreallcaps="0">
<text>gogle spel</text>
</spellrequest>
HTTP/1.0 200 OK
Content-Type: text/xml
Server: DocumentSpellcheck
Cache-Control: private, x-gzip-ok=""
Date: Sat, 07 Apr 2007 14:11:57 GMT
Connection: Close
<?xml version="1.0"?>
<spellresult
error="0"
clipped="0"
charschecked="10">
<c o="0" l="5" s="1">google Google goggle giggle Gogol</c>
<c o="6" l="4" s="1">spell spiel spelt spew Opel</c>
</spellresult>
The suggestions are tab-delineated. The ‘o’ attribute is an offset from the start of your query to the misspelled word. ‘l’ is the length of the misspelled word. ’s’ is the confidence of Google’s suggestion (presumably higher is better, but I’ve only gotten 0 or 1).
April 16th, 2007 at 11:14 pm
Very cool indeed! I have an old developer API key from their deprecated SOAP service but that limits you to 1000 queries per day. I’m surprised that this feature is “under wraps” since spell checking is such an important feature in any web application. DRY - Don’t repeat yourself!
April 19th, 2007 at 1:18 am
Very cool. Thanks for the two XML examples (the post and what is returned…those were the two missing pieces that I needed). Excellent post…and I’m using Firefox with the Google tool bar installed, and I notice it is spell checking this form field as we speak.
April 22nd, 2007 at 2:04 am
Cool stuff!! The example is excellent. Could you please post the java script code you have used in the demo on your web page.
Thanks
May 2nd, 2007 at 9:26 pm
Hi, thank’s for post this code, it’s cool. I’m begginer in PHP can you post a PHP example file or javascript example?
thank’s a lot
May 2nd, 2007 at 9:33 pm
You can download the javascript source for the form on this page at http://immike.net/js/check_spelling.js. It uses the jquery library. The PHP code is only a few lines, so here it is:
include(’SpellChecker.php’);
$query = $_GET['q'];
if(SpellChecker::Check($query)) {
echo ‘~~correct~~’;
}
else {
echo ‘~~incorrect~~’.SpellChecker::Correct($query);
}
May 3rd, 2007 at 2:02 pm
Hi again Mike, thank’s a lot for your post. I have a little problem, i have this error in SpellCheck.php:
Parse error: parse error, expecting `T_OLD_FUNCTION’ or `T_FUNCTION’ or `T_VAR’ or `’}” in c:\www\ecommerce\SpellChecker.php on line 16
Can you tell me what it’s wrong?
I use on my system:
-PHP4
-Apache for Windows
Looking your code… i need to install CURL?
NOTE: Can i show you my gratitude with a donative?
May 3rd, 2007 at 2:10 pm
Looks like you might need to change the types on the variable declarations in the SpellChecker class. Instead of “private static $variable” try changing it to “var $variable”. That should do the trick.
You might also have to take the scope specifications off of the functions. Instead of “public static function getInstance()” use just “function getInstance()”.
It’s been a while since I’ve used PHP4, and I don’t have any boxes left running it, so I can’t test these changes. Let me know how things go. Don’t worry about a donation, just glad I could help :).
May 3rd, 2007 at 3:36 pm
Hi again:
I made the changes but i have other problems:
Parse error: parse error, expecting `’(” in c:\www\ecommerce\SpellChecker.php on line 23
I uncomment the php_curl.dll in php.ini but the system can’t load.
Can you help me with other alternative?
Thank’s a lot
May 4th, 2007 at 9:30 am
Hi again, can you tell me if your code are limited to 1000 queries per day?
This is my messenger address: sebastian.jerez@hotmail.com, i need your help, please.
May 4th, 2007 at 11:30 am
Hey Sebastian,
As far as I know there are no limits on the number of requests per day that can be made. Since the RPC endpoint doesn’t use any sort of token identification it would be difficult for Google to implement a limiting mechanism (other than by IP address, which is generally a Really Bad Idea), so I kind of doubt there’s a limit.
That said, they would probably start to notice at some point. Sending hundreds of queries per second probably isn’t a great idea. I’d imagine 1,000 per day would be practically imperceptable given the amount of traffic Google handles daily.
May 4th, 2007 at 12:20 pm
Hi again, now i know that my problem it’s that i’m using PHP4 and your class are for PHP5, it’s possible that you convert your class for php4 usage? or how can i convert it into php4?
May 4th, 2007 at 2:00 pm
Try removing the scope declarations from the variables (private/public/static) and declare them as “var $variable” instead. Also remove the scope declarations for functions. You’ll also need to use an alternative to SimpleXML for the XML parsing.
Honestly, I’d recommend upgrading to PHP5. PHP4 is several years old now and I think a lot of people are getting tired of supporting it.
May 12th, 2007 at 8:35 am
Sebastian, to implement that class as PHP 4 :
private static $instance;
private static $_cache;
private function __construct( ) {}
Become :
var $instance;
var $_cache;
function SpellChecker() {}
__construct and __destruct are not part of PHP4
instead of using : self::$instance use $this->instance
and remove every “static”, “private”, “public” or “protected” declaration
And if I remember well, Exception are not part of PHP4 neither (try/catch block)
Hope this helps !
Thx for that great Class !
May 13th, 2007 at 4:42 pm
[...] Hacking Google Spell Checker for Fun and Profit - I’m Mike The RPC endpoint that Google uses to provide spelling corrections for the Google Toolbar is https://www.google.com/tbproxy/spell. Here’s a small PHP class (requires SimpleXML and CURL) that provides easy access to it. (tags: SpellCheck) [...]
May 14th, 2007 at 4:25 am
It’s fun.
I have tried “jsaen”, expected “jeans” to be given but google sent me “jason”.. which is the alternative word I bet.
Web Oriented Results ? :))
May 14th, 2007 at 9:18 pm
Yea, sometimes the results are pretty interesting. Google actually sends multiple suggestions, so you could offer several alternatives if you were building a spell-correction system.
For some more info on how the Google spell correction algorithm works, check out this article that explains how you can implement your own version (in Python, and in 21 lines of code). Pretty cool.
June 4th, 2007 at 11:23 am
Thanks for the code! Works great!
June 4th, 2007 at 3:01 pm
I have a problem with the line (39) in SpellChecker.php
39. return( strcasecmp($query, self::Correct($query, $lang, $hl)) === 0 );
ERROR:
Fatal error: Undefined class name ’self’ in c:\phpdev5\www\SpellChecker.php on line 39
Any equivalent code for PHP4?
thank’s
Sebastian
June 4th, 2007 at 3:21 pm
Yea, replace “self” with the classname:
return( strcasecmp($query, SpellChecker::Correct($query, $lang, $hl)) === 0 );
June 4th, 2007 at 3:46 pm
Don’t worry now i have a problem win simpleXML, i need to upgrade to PHP5, thank’s
June 4th, 2007 at 3:49 pm
It’s really worth the upgrade if you can do it. There are so many new features that make programming much more pleasant.
June 6th, 2007 at 8:58 am
I got this to work on one web server, but it will not work on another. The web server that it does not work on is running php5. I know CURL is enabled because I have scripts that use CURL. The only thing i can think of is that there might be an XML extension that is missing. Any ideas? Can you see if anything is missing from here?
[root@www /usr/local/etc/php]# more extensions.ini
extension=bz2.so
extension=calendar.so
extension=ctype.so
extension=curl.so
extension=dom.so
extension=ftp.so
extension=gd.so
extension=iconv.so
extension=mcrypt.so
extension=mysqli.so
extension=ncurses.so
extension=pcre.so
extension=zlib.so
extension=pdo.so
extension=posix.so
extension=pspell.so
extension=readline.so
extension=session.so
extension=simplexml.so
extension=soap.so
extension=sockets.so
extension=sqlite.so
extension=tokenizer.so
extension=xml.so
extension=xmlreader.so
extension=xmlwriter.so
extension=xsl.so
extension=zip.so
extension=mysql.so
extension=openssl.so
extension=pdf.so
extension=mbstring.so
Thanks.
June 7th, 2007 at 4:01 pm
Hi mike, i upgraded to PHP5 and now the code run ok, but now i have an error, “Certificate SSL problem”, i run the code on my local machine(apache for windows); i know that isn’t your problem, but maybe you know why i have this error, thank’s for this code
June 8th, 2007 at 9:04 am
Fixed the CURL problem in the previous post. I temporarily set error reporting to print to screen in php.ini. Ran the script again and determined it was a problem with permissions on the directory that holds the CA certs. I did a chmod on the directory and it works great now! Found a link that tells you how to correct this problem in two different ways. http://curl.haxx.se/mail/curlphp-2005-11/0038.html. Hope this saves someone some time.
June 8th, 2007 at 9:09 am
Sorry about the link above. I added a period at the end. This one works.
http://curl.haxx.se/mail/curlphp-2005-11/0038.html
June 21st, 2007 at 7:59 pm
@Sebastian: I haven’t done much testing on Windows, but I found a link that may help: http://curl.haxx.se/mail/archive-2006-05/0034.html
Looks like the SSL certificate is signed by a CA that cURL doesn’t like. Try adding the following line of code around line 113 to 117 (where all of the curl_setopt() lines are):
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
June 22nd, 2007 at 12:48 pm
Yuhuuuuu!!! done! thank’s so much
June 22nd, 2007 at 1:00 pm
Hi again i have a little doubt:
Why when i try this string:
“Crysler Sebring”
Google search return: “Chrysler Sebring” (correct)
and your code(google based) return: “Chysler sobering” (incorrect)? thank’s a lot
June 22nd, 2007 at 1:52 pm
And this string: Nokia 6130 return: Niki 6130 what’s wrong?
August 30th, 2007 at 6:55 am
Even if the thread is old,
Since i see a discussion around spell checkers, thought this link would be helpful.
http://norvig.com/spell-correct.html
November 2nd, 2007 at 10:00 pm
Hello,
I have downloaded the SpellChecker.php, check_spelling.js. ans also the jquery library. I have created the following code in a PHP page for testing:
Try it out!
When I type anything in the search box and then press the “check” button, it says “Correct spelling: undefined”. Please help. Thanks.
February 1st, 2008 at 8:16 am
i am very new to php, can i have full working file of spell checking as google in php with all files as zip and the steps i have to do for making the code to be worked out.
February 26th, 2008 at 5:22 am
Hi Mike ,
Iam new to php.So can u please send “Hacking Google Spell Checker for Fun and Profit” full coading for me. Thank u very much…..
March 13th, 2008 at 2:44 pm
Hi,
I wrote a Java program for the HTTP post to the Google spellchecker. It gives me results:
Do you have any ideas why?
March 29th, 2008 at 7:52 pm
Hey Mike, Thanks for the info. Heres a java version:
http://www.gmacker.com/web/content/tutorial/googlespellchecker/googlespellchecker.htm
May 13th, 2008 at 1:44 pm
Hail Mike, i was creating an online word reminder where people(kids) can store the words which they would like to be reminded often. i thought of using a spell check there and i got here… i had all the problems which sabastien had in the above queries and i solved it by reading this page from points 1 to 30 in think.. then it worked… i have to implement in my site… but one thing is IT IS WORKING WITHOUT SIMPLEXML by just using curl… i am using php4… so hats off… if some one is willing to see my site where this works please ping me and i will give you the link… if email id is needed its in my name… enjoy
August 8th, 2008 at 9:00 pm
Hi Mike, I try to use you script but result is always correct (even if I write full misspelled word). What it could be?
August 13th, 2008 at 3:10 pm
Hey Mike, all of the sudden this has stopped working. I’ve traced it to the ignoredups= attribute being sent. If it’s set to 1 as it is by default in your class, Google always reports that there are no errors. If you set this to 0 or remove the attribute, it works fine.
It affects the Google Toolbar as well, with the same setting (”Ignore Unknown Words Appearing Many Times”). Don’t suppose you know if something permanently changed in the API or if it’s just randomly been broken for the past couple weeks?
http://groups.google.com/group/IEToolbar-Group-Bugs/browse_thread/thread/445c5e66690261e3