Cuberick: Lucene, please don't do me no wrong.

Get your baby back with Lucene
Lucille or Lucene?

Lucene is an open source search and index API. It is by all accounts the gold standard of search and index APIs. A fire tested and proven technology, Lucene powers search functionality on many high traffic websites. To me the most impressive example is Wikipedia. As of this writing Wikipedia spans over 2,656,739 pages and their daily pageviews are spilling beyond the lofty borders of 1,000,000,000. If it is good enough for Wikipedia, it is good enough for anyone. All this and it is a freely available, well documented part of the Apache Software Foundation. Strangely though the history of Lucene is rather vauge. Most sources have little to say about the project other than it was written by Doug Cutting, and released on SourceForge in March 2000. I have done some "serious" research and uncovered a bit more information about the project. So stop the presses because I have the scoop on how Lucene got its name.

The name Lucene goes back to a song by Little Richard called "Lucille". The subject of the song is a man's search for his lost lover. Doug Cutting heard this song one morning while working on an early version of his search API and decided to name it for the lost lover. However, Cutting misunderstood Little Richard's lyrics and called the API Lucene instead of Lucille. Here are the lyrics to that song as Cutting must have understood them on that morning.

I woke up this morning
Lucene was not in sight.
I asked her friends about her
but all their lips were tight.
Lucene
please come back where you belong.
I been good to you baby
please don't leave me alone.

It is unlikely that Lucene will help you to win back your baby, but if you had stored his/her name and phone number somewhere on your filesystem you could use Lucene to track those digits down, and then you could index and search the complete works of Shakespeare for a really winning passage to read to him/her. Then you might get him/her back and you could credit Lucene, but don't bet on it. Lucky for me my baby isn't lost, but I have found other uses for Lucene. At work I have replaced an aging search engine (proprietary) with an indexing process that runs from cron. I then use the Lucene API from my servlets to query the indexes to find and rank the documents. Lucene has proved to be impressively fast, scalable, and easy to use.

My goal here is just to document some of the things that I have done, not to write a complete introduction. I am a big fan of looking through code examples that others have published in order to get ideas, and when learning a new API I find reading through well documented code to be more useful than a step by step tutorial. At the very least I will supply some code samples for parsing HTML and PDF files. I also have information about running your Lucene indexes out of cron and at the command line using Perl. If you have no experience with Lucene and are planning on implementing something I suggest you follow the docuemtation Here on the Apache website. They have excellent tutorials. You should also be aware that the Lucene download comes bundled with some very useful examples. In fact I have based my own indexer on the "org.apache.lucene.demo.IndexHTML" class that is provided as a "demo" for using Lucene. You will also want to check out the Lucene API.

The demo code that comes with Lucene should make it clear how you would parse pure text files, but their implementation of parsing HTML files is really lousy. I say it is lousy because they implement their own HTML parser. They probably did that because they didn't want to have the demo code be dependent on anything but the Lucene jar, but in practice you will not want to write your own HTML parser. I suggest you look at NekoHTML. That is what I use in the following code snippet.
[code lang="java"]
public HtmlParser(File file) throws IOException, SAXException, Exception {
FileReader in = new FileReader(file);
summary = new StringBuffer();
contents = new StringBuffer();
title = new StringBuffer();

// parse the html (throws exception if we don't get html)
DOMFragmentParser parser = new DOMFragmentParser();
DocumentFragment fragment = new HTMLDocumentImpl()
.createDocumentFragment();
log.debug("start parsing: " + file.getName());
parser.parse(new InputSource(new FileInputStream(file)), fragment);
log.debug("finished parsing: " + file.getName());

// get a string version of the text and title.
getText(contents, fragment);
getTitle(title, fragment);

// don't let a document with no text get in.
if (contents.length()==0) {
throw new Exception("The document is empty. FileName: "
+ file.getPath() + file.getName());
}

// if there is no text in the title, use the filename
if (title.length()==0) {
title = new StringBuffer(file.getName());
}

}
[/code]
One gotcha with using the NekoHTML jar is that if you use a JDK less than 1.5 you will need to add a few extra jars. The rt.jar that comes with JDK 1.5 has lots of org.w3c.dom... classes. I used Jarhoo to track down which jars I needed to make the thing work for an old 1.4.2 JDK. Here is a quick list to save you some time (There might be a way to get the classes you need with a smaller set of jars, but this works): xercesImpl.jar, xmlParserAPIs.jar, xmlbeans.jar. All those jars are available from the ibiblio maven repository.

As for the code snippet, there is nothing interesting until you get down to the comment that reads "// parse the html" I then use the Neko DOMFragmentParser (because I am not expecting well formatted HTML) to parse the HTML file. This stores the HTML into a DOM. At this point we can use a simple recursive method to traverse the DOM tree and retrieve whatever it is we want. You can see from my code that I call methods getTitle() and getText(). Both of these are simple recursive methods. Let's look at them.

[code lang="java"]
/**
* get all text from the HTML document
*
* @param sb
* @param node
*/
private void getText(StringBuffer sb, final Node node) {
// base case is when we are at a node containing text
// not children
if (node.getNodeType() == Node.TEXT_NODE) {
sb.append(node.getNodeValue());
}
NodeList children = node.getChildNodes();
if (children != null) {
int len = children.getLength();
for (int i = 0; i < len; i++) {
// recursively call for the next node
getText(sb, children.item(i));
}
}
}

/**
* get title from the HTML document
*
* @param sb
* @param node
*/
private void getTitle(StringBuffer sb, final Node node) {
// base case is when we are at a node containing text
// not children
if (node.getNodeType() == Node.ELEMENT_NODE) {
// if the node is a "title" node we get the text.
if ("title".equalsIgnoreCase(node.getNodeName())){
sb.append(node.getFirstChild().getNodeValue());
}
}
NodeList children = node.getChildNodes();
if (children != null) {
int len = children.getLength();
for (int i = 0; i < len; i++) {
// recursively call for the next node
getTitle(sb, children.item(i));
}
}
}
[/code]
The getContent() method is only concerned with grabbing all text contained within the document. It traverses the DOM tree looking for text nodes and storing them in a string buffer. The getTitle() is different in that we are looking for a specific element of an HTML document. So we traverse the tree looking at only element nodes. When we find the title element node we get the associated text node which is the child node of the title node. It looks simple, but I spent some time trying to figure this out as I had (and still have) very little real knowledge of DOMs. Of course if you are still with me at this point you must already know that Lucene can only index text. I mean, why else would I bother parsing HTML? Well, I also find it useful to parse PDF files sometimes. The Java Open Source community again comes through with a library called PDFBox. Here is a similar example code snippet for parsing PDFs.
[code lang="java"]
public PdfParser(File file) throws IOException, Exception {
FileInputStream in = new FileInputStream(file);
filename = file.getName();
contents = new StringBuffer();

// READ IN the file
COSDocument cosDoc = null;
try {
PDFParser parser = new PDFParser(in);
parser.parse();
cosDoc = parser.getDocument();
} catch (IOException e) {
cosDoc.close();
throw new Exception("Cannot parse PDF document", e);
}

// PARSE the entire text out for the contents
String docText = null;
try {
PDFTextStripper stripper = new PDFTextStripper();
docText = stripper.getText(new PDDocument(cosDoc));
} catch (IOException e) {
cosDoc.close();
throw new Exception("Cannot parse PDF document", e);
}
if (docText != null) {
contents = new StringBuffer(docText);
}

// EXTRACT PDF document's meta-data
PDDocument pdDoc = null;
try {
pdDoc = new PDDocument(cosDoc);
PDDocumentInformation docInfo = pdDoc.getDocumentInformation();
String author = docInfo.getAuthor();
title = docInfo.getTitle();
String keywords = docInfo.getKeywords();
summary = docInfo.getSubject();
} catch (Exception e) {
cosDoc.close();
pdDoc.close();
if (e.getMessage().indexOf("value cannot be null") != -1){
log.info("Meta data was null for " + file.getPath() + file.getName());
title = file.getName();
}else{
log.error("Cannot get PDF document meta-data: "
+ e.getMessage());
System.err.println("Cannot get PDF document meta-data: "
+ e.getMessage());
}
}
cosDoc.close();
pdDoc.close();
// don't let a document with no text get in.
if (contents == null || contents.equals("")) {
throw new Exception("The document is empty. FileName: "
+ file.getPath() + file.getName());
}

}
[/code]
Going step by step through the code, we create a COSDocument object. Grab all text with an instance of the PDFTextStripper object. Then we grab whatever PDF metadata interests us with an instance of PDDocumentInformation. That is a bit easier than having to traverse a DOM object, but you want to make sure to close off your COS and PDD doc objects because if you don't you will leave behind a lot of tmp files in your /tmp directory.

So now that we can convert HTML and PDF to text we can index these documents. Indexing is very simple in Lucene. I can't really add anything to what you would learn on the Apache website. Another good account of basic Lucene topics is found here. I am going to now briefly turn to searching the indexes. As with indexing there is plenty of documentation but I am going to give you a peek at a snippet of using filtered search results.
[code lang="java"]
// Setup analyzer
Analyzer analyzer = new StandardAnalyzer();
// Setup the searcher given the directory that contains the index.
Searcher searcher = new IndexSearcher("/web/files/shakespeare/index");

// we want to filter results by the by the type field..
Query filter = QueryParser.parse("sonnet", "type", analyzer);
Filter typeFilter = new QueryFilter(filter);
// run the query with the filter we created
Query query = QueryParser.parse("lucille, oh how i love you", "contents", analyzer);
Hits hits = searcher.search(query, typeFilter);

// Get the first result (highest rated)
Document doc = hits.doc(i);
// Return the title
return doc.get("title");
[/code]
In this example you can see I run a query against our index which was stored at "/web/files/shakespeare/index." The document type I used when I indexed had a "type" field which was of type Field.text. That means I can index or retrieve the information related to that field. In my example I limit search results to sonnets using this mechanism.

Phew. This article is getting too long already. I know I really flew through all that, but I feel I have covered the hardest and least documented parts of real world Lucene usage. I'll have to come back and write a continuation of this article tomorrow. It will cover running Lucene Indexes as a cron job, avoiding the use of Plucene or some other Lucene port (because they are slow), and any other odds and ends I can think up. It will be one more sleepless night for all of you who can't get your baby back without Lucene.

2 comments:

Shlomo said...: Thanks for finally revealing the true Lucene story!

Also, Neko did not exist when Lucene's HTML demo was built. Please feel free to contribute an improved HTML demo by sending it to the Lucene developer list.; May 12, 2007 at 3:26 PM
Shlomo said...: Yea, uh, sorry about that "lousy" bit. I was perhaps a bit carried away. One should not mock what one cannot understand. After 5 minutes trying to follow the HTML parsing class I completely blacked out and got a nasty bruise. I was just bitter about how that class affected my good looks. Thanks Doug.; May 12, 2007 at 3:26 PM

Tuesday, November 15, 2005

Lucene, please don't do me no wrong.

2 comments:

Good Stuff

Labels