AltLaw.org began life as a Ruby on Rails project. I was enamored of Rails at the time (who wasn’t?) and rapid-prototyping helped me get it up and running in just a few months.
The problem is that I’ve been trying to follow Rails’ conventions even when they aren’t exactly in line with what I need to do. Right now, AltLaw is a very standard Rails setup: Apache2, Mongrel, MySQL, and Ferret for searching. That works perfectly for every other site I run, but they each have a few hundred pages, not AltLaw’s hundreds of thousands. Ferret is a great search-engine library, but it’s still a work in progress. ActiveRecord is a great framework, but it’s slow and cumbersome when you’re trying to do batch operations on thousands of records.
To put it simply: dealing with this much data requires more than convention.
So as we look toward a non-beta 1.0 release some time in 2008, I’m trying to figure out what the site architecture should be. Rails can still provide the web interface. Solr looks like a good stable search solution, with the added benefit that we can offer the search API to other developers.
The part with no obvious solution is the data store. We have thousands of documents, from more than a dozen different sources, in half a dozen different formats (PDF, HTML, text, even WordPerfect). Different documents have different kinds of metadata, and we need to be able to store, index, and retrieve all of that in a consistent way.
I’ve tried putting everything in a relational database, but databases weren’t really designed to hold large amounts of text. I’ve tried putting everything into structured XML documents, but it’s really hard to get the schema right. I’ve also used a combination of a relational database and static files, which works but can get complicated.
So if anybody knows of open-source software designed for this type of thing, or books I should be reading about information architecture, I’m all ears.

Entries (RSS)