Continuous Integration for Data

As I told a friend recently, I’m pretty happy with the front-end code of AltLaw.  It’s just a simple Ruby on Rails app that uses Solr for search and storage.  The code is small and easy to maintain.

What I’m not happy with is the back-end code, the data extraction, formatting, and indexing.  It’s a hodge-podge of Ruby, Perl, Clojure, C, shell scripts, SQL, XML, RDF, and text files that could make the most dedicated Unix hacker blanch.  It works, but just barely, and I panic every time I think about implementing a new feature.

This is a software engineering problem rather than a pure computer science problem, and I never pretended to be a software engineer.  (I never pretended to be a computer scientist, either.)  It might also be a problem for a larger team than an army of one (plus a few volunteers).

But given that I can get more processing power (via Amazon) more easily than I can get more programmers, how can I make use of the resources I have to enhance my own productivity?

I’m studying Hadoop and Cascading in the hopes that they will help.  But those systems are inherently batch-oriented.  I’d like to move away from a batch processing model if I can.  Given that AltLaw acquires 50 to 100 new cases per day, adding to a growing database of over half a million, what I would really like to have is a kind of “continuous integration” process for data.  I want a server that runs continuously, accepting new data and new code and automatically re-running processes as needed to keep up with dependencies.  Perhaps given a year of free time I could invent this myself, but I’m too busy debugging my shell scripts.