Continuous Integration for Data – Digital Digressions by Stuart Sierra

As I told a friend recently, I’m pretty happy with the front-end code of AltLaw. It’s just a simple Ruby on Rails app that uses Solr for search and storage. The code is small and easy to maintain.

What I’m not happy with is the back-end code, the data extraction, formatting, and indexing. It’s a hodge-podge of Ruby, Perl, Clojure, C, shell scripts, SQL, XML, RDF, and text files that could make the most dedicated Unix hacker blanch. It works, but just barely, and I panic every time I think about implementing a new feature.

This is a software engineering problem rather than a pure computer science problem, and I never pretended to be a software engineer. (I never pretended to be a computer scientist, either.) It might also be a problem for a larger team than an army of one (plus a few volunteers).

But given that I can get more processing power (via Amazon) more easily than I can get more programmers, how can I make use of the resources I have to enhance my own productivity?

I’m studying Hadoop and Cascading in the hopes that they will help. But those systems are inherently batch-oriented. I’d like to move away from a batch processing model if I can. Given that AltLaw acquires 50 to 100 new cases per day, adding to a growing database of over half a million, what I would really like to have is a kind of “continuous integration” process for data. I want a server that runs continuously, accepting new data and new code and automatically re-running processes as needed to keep up with dependencies. Perhaps given a year of free time I could invent this myself, but I’m too busy debugging my shell scripts.