It’s rather ironic that, while search is nearly ubiquitous on the web, there is no perfect solution for adding search functionality to a web application. Many developers simply use the basic search functionality built into whatever database server they’re using. Until recently, systems that required a more feature-rich, efficient, or flexible search solution had to turn to proprietary commercial software. But this is no longer the case. Apache’s Lucene project has brought the open source community a sophisticated and flexible search solution that rivals most commercial packages.

Apache Lucene is a high-performance, feature-rich text search engine written in Java. It’s a cross-platform technology that is suitable for nearly any application that requires full-text search. If Java isn’t your language of choice, no worries — a sub-project called Solr wraps Lucene in a simple web service layer, making it simple to use from any language.

Getting started with Solr

In the remainder of this post I’ll walk through setting up a Lucene search solution using Solr. Getting Solr up and running on your system is easy. First, make sure you have Java installed, then follow these simple steps:

  1. Grab the latest release of Solr from an Apache mirror site (I’m using version 1.2)
  2. Extract the archive (unzip apache-solr-1.2.0.zip or tar -zxvf apache-solr-1.2.0.tgz)
  3. Launch Jetty with the Solr WAR and the example configs: cd apache-solr-1.2.0/example; java -jar start.jar

That’s it. Solr is now up and running on your system, listening on port 8983. If you visit http://localhost:8983/solr/admin you should see the Solr admin panel.

A real world example

To demonstrate how Lucene works, we’ll use the Digg API to index the last 50,000 or so stories submitted to digg.com. Start by creating a copy of the example directory called digg (cp -r example digg), then change to the newly created directory.

Like most Java applications, Solr configuration files are stored as XML. Most of the default options are fine, but we’ll need to change the example schema so that Lucene can properly archive data from Digg. The schema.xml file is located in the ./solr/conf/ subdirectory, and specifies the fields our documents contain.

The <fields> section is where you list individual <field> elements. Each <field> has a name that you use to reference it when adding documents or executing searches, and a type that identifies the field’s data type. Data types are declared using <fieldType> elements. The standard types in the example configuration (text, boolean, sint, slong, sfloat, sdouble, among others) are sufficient for most applications.

To index Digg submissions, we need to store a number of fields, including the article description, title, link, topic, user, and status. For each field you can specify whether it should be indexed (so that it can be used in searches) and/or stored (so that you can retrieve its value directly from Lucene). If your documents are stored in a database, you may choose to have Lucene index some fields, but not store them. As long as Lucene stores each document’s primary key, the document can be reconstructed directly from the database. This reduces Lucene’s storage requirements, and may help prevent some data anomalies.

I’ve added the following fields to index and store documents from Digg:

<field name="id" type="string" indexed="true" stored="true"
       required="true"/>
<field name="href" type="text" indexed="true" stored="true"
       required="true&quit;/>
<field name="link" type="text" indexed="true" stored="true"
       required="true"/>
<field name="title" type="text" indexed="true" stored="true"/>
<field name="description" type="text" indexed="true"
       stored="true"/>
<field name="status" type="text" indexed="true" stored="true"/>
<field name="user" type="text" index="true" stored="true"/>
<field name="topic" type="text" index="true" stored="true"/>
<field name="container" type="text" index="true" stored="true"/>
<field name="diggs" type="sint" indexed="true" stored="true"
       default="0"/>
<field name="comments" type="sint" indexed="true" stored="true"
       default="0"/>

Queries sent to Lucene will, by default, search a single field. This field is specified using the <defaultSearchField> element. Since we want to search multiple fields (description, title, user, etc.), we’ll set the default search field to a catchall text field called text, then copy each field we want searched into it using the <copyField> element.

Any number of <copyField> declarations can be included in a schema. They instruct Solr to copy data from the field specified by the source attribute to the field specified in the dest attribute. We’ll copy each document’s title, description, user, topic, and container into our catchall ‘text’ field.

<copyField source="title" dest="text"/>
<copyField source="description" dest="text"/>
<copyField source="user" dest="text"/>
<copyField source="topic" dest="text"/>
<copyField source="container" dest="text"/>

That’s pretty much it for the schema. You can download my schema.xml file here if you want to play along.

Solr is now configured and ready to index documents. Again, to start Solr, run java -jar start.jar from the digg directory we created earlier. As long as there are no mistakes, Jetty should launch the Solr WAR with our new configuration.

Getting documents into Solr

Our Solr server is up and running, but it doesn’t contain any data. You interact with Solr programatically using a RESTful web service API. You can modify a Solr index by POSTing XML documents containing instructions to add or delete documents, commit pending adds and deletes, and to optimize your index. An example command looks like this:

<add>
  <doc>
    <field name="id">2321564</field>
    <field name="link">http://immike.net/...</field>
    <field name="href">http://digg.com/...</field>
    <field name="status">upcoming</field>
    <field name="title">Article Title</field>
    <field name="description">Article Description...</field>
    <field name="user">mmalone</field>
    <field name="topic">Offbeat News</field>
    <field name="container">World &amp;amp; Business</field>
    <field name="diggs">2</field>
  </doc>
</add>

There are a number of libraries that allow you to easily interact with Solr from various programming languages. The SolrUpdate PHP client lets you add an array of documents (represented as associative arrays with field names as keys) to a Solr index and commit them in a few lines of code. For an example of how SolrUpdate works, check out this script for indexing Digg submissions.

Querying your Solr index

Searches are performed using HTTP GET requests at the /solr/select URL of your Solr webserver. The query is passed in the q parameter. A number of request parameters can be used to control what information is returned. A typical query might look something like this: http://localhost:8983/solr/select?q=iphone.

Solr returns XML search results, so you can manipulate them however you like. You can even query Solr directly using an XMLHttpRequest, and manipulate the results using Javascript. A typical query result looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
 <int name="status">0</int>
 <int name="QTime">1</int>
 <lst name="params">
  <str name="indent">on</str>
  <str name="rows">10</str>
  <str name="start">0</str>
  <str name="q">iphone</str>
  <str name="version">2.2</str>
 </lst>
</lst>
<result name="response" numFound="2891" start="0">
 <doc>
  <int name="comments">0</int>
  <str name="container">Technology</str>
  <str name="description">Video of iphone unboxing.</str>
  <int name="diggs">2</int>
  <str name="href">http://digg.com/apple/Video_of_iphone_unboxing</str>
  <str name="id">2377017</str>
  <str name="link">http://www.iphonematters.com/article/video_of_iphone_unboxing_from_trunk_of_car/</str>
  <str name="status">upcoming</str>
  <date name="timestamp">2007-06-30T22:30:39.408Z</date>
  <str name="title">Video of iphone unboxing</str>
  <str name="topic">Apple</str>
  <str name="user">titlesaysitall</str>
 </doc>
...
</response>

More Information

Lucene and Solr are highly customizable, and I’ve only touched on the most basic functionality of each application. Additional information can be found on the solr website and the Lucene website. In particular, you should read the Solr tutorial, which walks you through setting up the example search index distributed with Solr. There is also a Solr Wiki and a Lucene Wiki, both of which are great resources if you’re interested in learning more about what these applications have to offer.