AltLaw.org began life as a Ruby on Rails project. I was enamored of Rails at the time (who wasn’t?) and rapid-prototyping helped me get it up and running in just a few months.
The problem is that I’ve been trying to follow Rails’ conventions even when they aren’t exactly in line with what I need to do. Right now, AltLaw is a very standard Rails setup: Apache2, Mongrel, MySQL, and Ferret for searching. That works perfectly for every other site I run, but they each have a few hundred pages, not AltLaw’s hundreds of thousands. Ferret is a great search-engine library, but it’s still a work in progress. ActiveRecord is a great framework, but it’s slow and cumbersome when you’re trying to do batch operations on thousands of records.
To put it simply: dealing with this much data requires more than convention.
So as we look toward a non-beta 1.0 release some time in 2008, I’m trying to figure out what the site architecture should be. Rails can still provide the web interface. Solr looks like a good stable search solution, with the added benefit that we can offer the search API to other developers.
The part with no obvious solution is the data store. We have thousands of documents, from more than a dozen different sources, in half a dozen different formats (PDF, HTML, text, even WordPerfect). Different documents have different kinds of metadata, and we need to be able to store, index, and retrieve all of that in a consistent way.
I’ve tried putting everything in a relational database, but databases weren’t really designed to hold large amounts of text. I’ve tried putting everything into structured XML documents, but it’s really hard to get the schema right. I’ve also used a combination of a relational database and static files, which works but can get complicated.
So if anybody knows of open-source software designed for this type of thing, or books I should be reading about information architecture, I’m all ears.
You’ve basically described Lucene, the library of which Ferret is a port. I don’t know of any other library for searching structured text, though I think PyLucene is more mature than Ferret. However the last time I seriously looked into this was around 2005.
We’re using Solr at 37signals to provide search for all our applications dealing with gigabytes of data. The data is based off web documents and text, though. Not PDFs and so on. It works really well.
Sorry, I didn’t read this entry closely enough. I hadn’t heard of Solr but that sounds like the best choice for indexing. For storage I’d keep the data in whatever format it fits in naturally (probably flat files, but maybe database tables) and keep all the standardized metadata in the Solr index.
Not sure if this will help either, but there you go.
Have you looked at Greenstone Digital Library system http://www.greenstone.org/ ?
Active, helpful developers and friendly community, I recommend it.
Stephen
Try looking at couchDB, it’s designed for semi-structured data like this.
I have similar issues with one of my core products.
I first starting using ferret too, but one morning the index files it builds started to grow tremendously without rhyme or reason, maxing out hd space. I switched everything over to solr that day and it’s been a solid performer for over a year now. Very fast when you know how to talk to the solr server.
The acts_as_solr plugin (at least early versions of it) were naive implementations. I’ve branched my own version where I can delay commits to the solr db in order to make it fast for large updates as well. For searching, I query the solr server to return strictly ids, then query the db with those. That way not a terrible amount of data is being traded. Also, you have to make sure that documents deleted from the db also get deleted from the solr index.
For batch updates/uploads to the db, check out http://agilewebdevelopment.com/plugins/ar_extensions . It has a batch import function that you can also get to work to update records with unique keys. I had huge huge huge performance gains when using this plugin.
Sorry I can’t help you with the best way to store all of the text in general. I keep newspaper articles just fine in mysql. I’m sure there’s a way. Lexis Nexis does it.
Try couchdb http://couchdb.org
There has been quite a bit of buzz on CouchDB… sounds right up your ally.
http://couchdb.com/CouchDB/CouchDBWeb.nsf/Home?OpenForm
For the data store, have you considered couchdb? I’ve not used it, but I’m dying to. It’s pretty niche and new at this point, but maybe it’s worth a look for you.