Posts Tagged “Solr”

Since I started working on AltLaw.org a little over a year ago, I’ve struggled with the data model. I’ve got around half a million documents, totally unstructured, from a dozen different sources. Each source has its own quirks and problems that have to be corrected before it’s useful.

When I started, I was using MySQL + the filesystem + Ferret. Keeping all that in sync, even with Rails’ ActiveRecord, was tricky, and searches were slow. Now I’ve got everything stored directly in a Solr index, which is very fast for searching. But it’s not so great when I want to change, say, how I convert the source PDFs into HTML, and I have to re-index half a million documents. And it can’t store structured data very well, so I augment it with serialized hash tables and SQLite.

What I want is a storage engine that combines properties of a relational database and a full-text search engine. It would put structured fields into SQL-like tables and unstructured fields into an inverted index for text searches. I want to be able to do queries that combine structured and unstructured data, like “cases decided in the Second Circuit between 1990 and 2000 that cite Sony v. Universal and contain the phrase ‘fair use’.” I also want to store metadata about my metadata, like where it came from, how accurate I think it is, etc.

I’ve been exploring some interesting alternatives: CouchDB is probably the closest to what I want, but it’s still under heavy development. DSpace is oriented more towards archival than search. Then there’s the big RDF monster, maybe the most flexible data model ever, which makes it correspondingly difficult to grasp.

I’m come to the conclusion that there is no perfect data model. Or, to put it another way, no storage engine is going to do my work for me. What I need to do is come up with a solid, flexible API that provides just the features I need, then put that API in front of one or more back-ends (Lucene, SQL, the filesystem) that handle whatever they’re best at.

Comments 3 Comments »

I am happy to report that AltLaw.org’s switch to Solr has worked very well. Solr is a RESTful search engine, built on Lucene. The setup was more complicated than just using a search library, but the rewards were worth it.

Before, I was using Ferret, which I still like. It’s a great library, and Dave Balmain has done incredible work producing a fast search engine that integrates well with Ruby. I still use it on other sites. But Solr was a better fit for AltLaw.

With Ferret, I was trying to shoe-horn large, unstructured documents into a system — ActiveRecord, MySQL, and acts_as_ferret — that is better suited to small, structured records. Now I use Solr as both a search engine and a document store, eliminating MySQL from the picture. That, combined with Solr’s built-in caching, has dramatically decreased the server load (from around 2.00 to under 0.30) while visibly improving search performance.

Also, I think it helps that Solr is not integrated with Ruby. The solr-ruby gem is not well documented, but easy to figure out, as it’s just a thin wrapper over the Solr APIs. Having the search engine in a separate process made it easier to separate the indexing & searching part of the code from the web application. As a result, the Rails code shrunk to one-fourth its former size.

Comments No Comments »