Full text search with Apache Lucene
Software, Tutorials, Web Development July 3rd, 2007 - 8,641 viewsIt’s rather ironic that, while search is nearly ubiquitous on the web, there is no perfect solution for adding search functionality to a web application. Many developers simply use the basic search functionality built into whatever database server they’re using. Until recently, systems that required a more feature-rich, efficient, or flexible search solution had to turn to proprietary commercial software. But this is no longer the case. Apache’s Lucene project has brought the open source community a sophisticated and flexible search solution that rivals most commercial packages.
Apache Lucene is a high-performance, feature-rich text search engine written in Java. It’s a cross-platform technology that is suitable for nearly any application that requires full-text search. If Java isn’t your language of choice, no worries — a sub-project called Solr wraps Lucene in a simple web service layer, making it simple to use from any language.
Getting started with Solr
In the remainder of this post I’ll walk through setting up a Lucene search solution using Solr. Getting Solr up and running on your system is easy. First, make sure you have Java installed, then follow these simple steps:
- Grab the latest release of Solr from an Apache mirror site (I’m using version 1.2)
- Extract the archive (
unzip apache-solr-1.2.0.ziportar -zxvf apache-solr-1.2.0.tgz) - Launch Jetty with the Solr WAR and the example configs:
cd apache-solr-1.2.0/example; java -jar start.jar
That’s it. Solr is now up and running on your system, listening on port 8983. If you visit http://localhost:8983/solr/admin you should see the Solr admin panel.
A real world example
To demonstrate how Lucene works, we’ll use the Digg API to index the last 50,000 or so stories submitted to digg.com. Start by creating a copy of the example directory called digg (cp -r example digg), then change to the newly created directory.
Like most Java applications, Solr configuration files are stored as XML. Most of the default options are fine, but we’ll need to change the example schema so that Lucene can properly archive data from Digg. The schema.xml file is located in the ./solr/conf/ subdirectory, and specifies the fields our documents contain.
The <fields> section is where you list individual <field> elements. Each <field> has a name that you use to reference it when adding documents or executing searches, and a type that identifies the field’s data type. Data types are declared using <fieldType> elements. The standard types in the example configuration (text, boolean, sint, slong, sfloat, sdouble, among others) are sufficient for most applications.
To index Digg submissions, we need to store a number of fields, including the article description, title, link, topic, user, and status. For each field you can specify whether it should be indexed (so that it can be used in searches) and/or stored (so that you can retrieve its value directly from Lucene). If your documents are stored in a database, you may choose to have Lucene index some fields, but not store them. As long as Lucene stores each document’s primary key, the document can be reconstructed directly from the database. This reduces Lucene’s storage requirements, and may help prevent some data anomalies.
I’ve added the following fields to index and store documents from Digg:
<field name="id" type="string" indexed="true" stored="true"
required="true"/>
<field name="href" type="text" indexed="true" stored="true"
required="true&quit;/>
<field name="link" type="text" indexed="true" stored="true"
required="true"/>
<field name="title" type="text" indexed="true" stored="true"/>
<field name="description" type="text" indexed="true"
stored="true"/>
<field name="status" type="text" indexed="true" stored="true"/>
<field name="user" type="text" index="true" stored="true"/>
<field name="topic" type="text" index="true" stored="true"/>
<field name="container" type="text" index="true" stored="true"/>
<field name="diggs" type="sint" indexed="true" stored="true"
default="0"/>
<field name="comments" type="sint" indexed="true" stored="true"
default="0"/>
Queries sent to Lucene will, by default, search a single field. This field is specified using the <defaultSearchField> element. Since we want to search multiple fields (description, title, user, etc.), we’ll set the default search field to a catchall text field called text, then copy each field we want searched into it using the <copyField> element.
Any number of <copyField> declarations can be included in a schema. They instruct Solr to copy data from the field specified by the source attribute to the field specified in the dest attribute. We’ll copy each document’s title, description, user, topic, and container into our catchall ‘text’ field.
<copyField source="title" dest="text"/> <copyField source="description" dest="text"/> <copyField source="user" dest="text"/> <copyField source="topic" dest="text"/> <copyField source="container" dest="text"/>
That’s pretty much it for the schema. You can download my schema.xml file here if you want to play along.
Solr is now configured and ready to index documents. Again, to start Solr, run java -jar start.jar from the digg directory we created earlier. As long as there are no mistakes, Jetty should launch the Solr WAR with our new configuration.
Getting documents into Solr
Our Solr server is up and running, but it doesn’t contain any data. You interact with Solr programatically using a RESTful web service API. You can modify a Solr index by POSTing XML documents containing instructions to add or delete documents, commit pending adds and deletes, and to optimize your index. An example command looks like this:
<add>
<doc>
<field name="id">2321564</field>
<field name="link">http://immike.net/...</field>
<field name="href">http://digg.com/...</field>
<field name="status">upcoming</field>
<field name="title">Article Title</field>
<field name="description">Article Description...</field>
<field name="user">mmalone</field>
<field name="topic">Offbeat News</field>
<field name="container">World &amp; Business</field>
<field name="diggs">2</field>
</doc>
</add>
There are a number of libraries that allow you to easily interact with Solr from various programming languages. The SolrUpdate PHP client lets you add an array of documents (represented as associative arrays with field names as keys) to a Solr index and commit them in a few lines of code. For an example of how SolrUpdate works, check out this script for indexing Digg submissions.
Querying your Solr index
Searches are performed using HTTP GET requests at the /solr/select URL of your Solr webserver. The query is passed in the q parameter. A number of request parameters can be used to control what information is returned. A typical query might look something like this: http://localhost:8983/solr/select?q=iphone.
Solr returns XML search results, so you can manipulate them however you like. You can even query Solr directly using an XMLHttpRequest, and manipulate the results using Javascript. A typical query result looks like this:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">1</int> <lst name="params"> <str name="indent">on</str> <str name="rows">10</str> <str name="start">0</str> <str name="q">iphone</str> <str name="version">2.2</str> </lst> </lst> <result name="response" numFound="2891" start="0"> <doc> <int name="comments">0</int> <str name="container">Technology</str> <str name="description">Video of iphone unboxing.</str> <int name="diggs">2</int> <str name="href">http://digg.com/apple/Video_of_iphone_unboxing</str> <str name="id">2377017</str> <str name="link">http://www.iphonematters.com/article/video_of_iphone_unboxing_from_trunk_of_car/</str> <str name="status">upcoming</str> <date name="timestamp">2007-06-30T22:30:39.408Z</date> <str name="title">Video of iphone unboxing</str> <str name="topic">Apple</str> <str name="user">titlesaysitall</str> </doc> ... </response>
More Information
Lucene and Solr are highly customizable, and I’ve only touched on the most basic functionality of each application. Additional information can be found on the solr website and the Lucene website. In particular, you should read the Solr tutorial, which walks you through setting up the example search index distributed with Solr. There is also a Solr Wiki and a Lucene Wiki, both of which are great resources if you’re interested in learning more about what these applications have to offer.
July 5th, 2007 at 9:01 pm
Looks pretty simple to do.
DBSight is pretty similar to SOLR, except it focuses on database. You can easily flatten 1:m relationships, synchronize with the database updates, without any coding.
You are very welcome to try it out also.
July 5th, 2007 at 9:11 pm
Looks interesting Chris. I’ll have to try it out, but it looks like it solves a lot of the data duplication / anomaly problems that can occur with Lucene. Seems like it basically automates & standardizes what a lot of people are already doing with custom scripts - synching a relational database with a Lucene index.
July 5th, 2007 at 9:20 pm
Exactly. DBSight makes creating a database search super easy and less error-prone. You can quickly create a search and consume the search in either XML/JSON/HTML by other languages, like LAMP applications.
July 6th, 2007 at 7:30 pm
[…] Full text search with Apache Lucene - I?m Mike - Apache Lucene is a high-performance, feature-rich text search engine written in Java. A sub-project called Solr wraps Lucene in a simple web service layer, making it simple to use from any language. […]
July 7th, 2007 at 2:15 am
Pretty cool stuff. I love the Digg reference. That made the whole article for me.
July 7th, 2007 at 2:21 am
Haha, glad to hear it Micah. I really just used Digg because they provided enough data through their API to do something interesting. While I was writing this piece, however, I learned that Digg actually uses Lucene for their search engine.
July 12th, 2007 at 1:21 pm
The CopyField command is interesting; I’m curious about what that does to the ability of Lucene to use field boosting for ranking purposes, though. Could you still give documents with the search term(s) in, say, the title field a boost in terms of ranking? Or does using the CopyField command to combine all the fields mean the search ranking will be based on word count only?
July 12th, 2007 at 1:57 pm
Unfortunately, I think using CopyField forces you to weight all text equally. You could try implementing your own Query parser, or simply rewriting queries so that they search both fields (e.g., rewrite “query” as “title:query OR query”). In my experience, the results are pretty good without explicitly weighting fields.
I have a book on Lucene that would probably provide a lot of useful information and might give a simple solution, but it’s on a truck headed out to San Francisco right now. Bad timing.
August 1st, 2007 at 10:08 am
If anyone here is a PHP coder, you should check out Zend Search Lucene. I got it up and running in like 15 minutes. At the time when I fiddled around with it, there were sparse documentation, but I’m guessing that’s changed a little since the Zend Framework has gone 1.0.1 now.
March 14th, 2008 at 1:14 am
I am a novice programmer…Can U kindly guide me, how to query Lucene created indexes with solr, but to be called within a java class?
Thank you
April 2nd, 2008 at 10:25 pm
Unfortunately, I am finding the solution indexing database with Lucene and zend framework.
Please, give me your idea for this, I love you so much.
Thanks