Cuberick: Lucene: Boosting, Multi-Field Searches, and Multi-Index Searches.

3 Completely Obvious Lucene Tricks
Lucene!

Look out below! Here comes another braindump, and it's all about Lucene. There are three tricks to using Lucene that I am going to illustrate here. I don't think they really qualify as tricks since there is nothing tricky about them, but if you are just getting started with Lucene it is nice to know they are there.

The first trick is boosting fields and documents. Boosting simply refers to increasing the relevance of something in the search results. There are several ways of using boosting in your application. If the application you are searching with uses Lucene's default Query Parser then you can tell lucene to boost certain terms by using a special syntax in your query string. Here is an example that would be good for searching an archive of essays both literary and smutty. If you wanted to search for documents on Moby Dick, being sure to leave out results that were focused on the D-bomb you might try something like this: "Moby^4 Dick." That boosts the relevance of results containing moby so that with luck you won't have too many that are all about Dick. That is a nice way to use boosting, but sometimes you want to use it as a programmer. Let's say for example that you want to boost documents written by Herman Melville in your search results. The following code is an example of boosting a document based on a field.
[code lang="java"]
// make a new, empty document
Document doc = new Document();
// Add the tag-stripped contents as a Reader-valued Text field so it
// will get tokenized and indexed.
doc.add(Field.UnStored("contents", parser.getContents()));
// Add the author as a separate Text field, so that it can be searched
// separately, and give it a boost if it is Herman Melville.
Field authorField = Field.Text("author", parser.getAuthor());
if (parser.getAuthor().equals("Herman Melville"){
titleField.setBoost(2);
doc.add(titleField);
}
[/code]
It is also possible to boost an entire document, but I've never done it so I can't really comment on that here. For my next completely obvious trick I am going to talk about running queries over multiple fields in a document. This is one of those things that is a no-brainer and it only takes a quick survey of the API to figure this one out. Still, I insist on telling you all about it. Ok, maybe I'll just skip the witty didacticisms and just give you a link to the class that makes it possible. That should make it pretty clear. Here is a code snippet that uses it just for giggles.
[code lang="java"]
Analyzer analyzer = new StandardAnalyzer();
smutSearch = new IndexSearcher("/web/files/indexes/literature/smut");
litSearch = new IndexSearcher("/web/files/indexes/literature/high_falootin");
Searcher[] allSearches = { smutSearch, litSearch};
MultiSearcher searcher = new MultiSearcher(allSearches);

// filter by the year.
Query filter = QueryParser.parse(year, "year", analyzer);
Filter yearFilter = new QueryFilter(filter);
String[] fields = {"contents", "author"};
Query query = MultiFieldQueryParser.parse(queryString, fields, analyzer);
Hits hits = searcher.search(query, yearFilter);
[/code]
So you can see I am running the query String against both the contents and author fields of all documents in those indexes. Which brings me to my third obvious trick, searching across multiple indexes. Again there is a class in the API that will show you the way. On top of that you can look at my code example above to see how I used it. I guess that wasn't so bad after all. Three obvious tricks and it only took me 2,000 words to explain them. What a maroon.

Thursday, January 12, 2006

Lucene: Boosting, Multi-Field Searches, and Multi-Index Searches.

1 comment:

Good Stuff

Labels