Archive for February, 2008

I couldn’t sleep last night. So I designed a RESTful interface for version control. Yeah, that’s weird.

We already have a nice model for RESTful document storage in Amazon’s S3. But S3 doesn’t do versioning. What if it did?

A PUT to a previously-unused URI creates a new resource with version number 1.

Any subsequent PUT to that URI overwrites the resource and increments the version number.

When responding to a GET, the server sends three additional headers:

  1. X-Version-Number: the version number being sent;
  2. X-Version-Time: the timestamp for that version; and
  3. X-Latest-Version: the latest version number available.

If the client wants the latest version, it uses a normal GET.

If the client wants a specific version, it can include either X-Version-Number or X-Version-Time in the request header. For the former, the server returns the requested version. For the latter, the server returns the latest version recorded at or before the specified time.

Hey, presto, we’ve got S3 with versioning! But it’s not a real version control system like Subversion. Every new version overwrites the previous version unconditionally. There’s no notion of diffs or conflicts.

But wait, it’s time for everybody’s favorite HTTP verb, POST!

The client can send a POST with a body containing a diff from the previous version. If there’s a conflict, the server responds with “HTTP 409 Conflict” and shows the conflicting lines in the response body. Otherwise, it’s “HTTP 200 OK” and a new version is created.

We can make this look more like a traditional source-code version control system by adding some extra headers: X-Version-Comment and X-Version-Author for recording metadata about a revision. We could also add X-Version-Changeset with a value that identifies this revision as part of a larger set of changes.

Branching. The RESTful version control system can implement branching the same way that Subversion does: each branch goes in its own directory. Add a couple more headers: X-Branch-From says what resource this resource was branched from, and X-Branch-Version says at which version of the original resource the branch occurred. Include these headers in a PUT to create a new branch, then the server will always send them in response to GETs.

Security. Meh, if you want security, do it at the server or protocol level. Adding authentication to this interface would make it much more complicated.

Honestly, it feels like HTTP was designed for this. I’ve deviated slightly from the typical REST definition of POST, but I think my use of POST here is in keeping with the spirit of REST.

I haven’t implemented this, maybe I will some time.

Comments No Comments »

Well, it happened. I ran out of space on the 250 GB drive I use to develop AltLaw. Not all that surprising, although it did happen sooner than I expected. I’m deleting gigabytes of cached data — file conversions, mostly — just so I can get enough space to work again.

But this makes me wonder: am I operating at the edge of what is feasible (or practical) on a single machine? The latest batch of case updates, courtesy of public.resource.org, took nearly three days to run. And I had to divide the process up into smaller batches because a memory leak — traceable, I think, to Ruby’s HTTP library — was bogging down the system. When you’re dealing with hundreds of thousands of files at a time, even the standard UNIX tools like ls and grep start to feel inadequate. It’s damn hard to get a grip on a million files at once. Just clearing some space with a simple “rm -rf” can take hours.

Maybe it’s time for Amazon Web Services. But that’s a big, scary, enterprisey world. I’m just one little script monkey. I can barely keep track of data on one machine, how could I keep track of it on a dozen machines? Then again, AWS would give me more powerful tools for managing data — a bigger pair of tongs, so to speak. Decisions, decisions.

Comments No Comments »

I can’t help it. I just love big, complicated systems that let you get really precise about what you’re talking about. Types, classes, ontologies, schemas, normalization, denormalization, XML, RDF, XSLT, Java, … It’s all so cool. I can happily spend hours scribbling pages of hierarchies, interfaces, specifications, file formats, and the like.

But at the end of the day, I have a pile of hand-written notes and no code. All those fancy systems I love to study are not that useful when it comes to actually doing something useful. When I want to get something done, I hack up a Ruby script. It’s not elegant, but it works.

AltLaw makes a particularly tempting target for my engineering fantasies: all that unstructured data, just waiting to be tied to a big database table, bound with constraints, and queried relentlessly. But to get the site to actually work, I use a flat index that stores everything as (gasp) strings.

Perhaps I’m learning the virtue of the “worse is better” approach to engineering. The only problem is, “worse is better” is worse fun.

Comments No Comments »

Since I started working on AltLaw.org a little over a year ago, I’ve struggled with the data model. I’ve got around half a million documents, totally unstructured, from a dozen different sources. Each source has its own quirks and problems that have to be corrected before it’s useful.

When I started, I was using MySQL + the filesystem + Ferret. Keeping all that in sync, even with Rails’ ActiveRecord, was tricky, and searches were slow. Now I’ve got everything stored directly in a Solr index, which is very fast for searching. But it’s not so great when I want to change, say, how I convert the source PDFs into HTML, and I have to re-index half a million documents. And it can’t store structured data very well, so I augment it with serialized hash tables and SQLite.

What I want is a storage engine that combines properties of a relational database and a full-text search engine. It would put structured fields into SQL-like tables and unstructured fields into an inverted index for text searches. I want to be able to do queries that combine structured and unstructured data, like “cases decided in the Second Circuit between 1990 and 2000 that cite Sony v. Universal and contain the phrase ‘fair use’.” I also want to store metadata about my metadata, like where it came from, how accurate I think it is, etc.

I’ve been exploring some interesting alternatives: CouchDB is probably the closest to what I want, but it’s still under heavy development. DSpace is oriented more towards archival than search. Then there’s the big RDF monster, maybe the most flexible data model ever, which makes it correspondingly difficult to grasp.

I’m come to the conclusion that there is no perfect data model. Or, to put it another way, no storage engine is going to do my work for me. What I need to do is come up with a solid, flexible API that provides just the features I need, then put that API in front of one or more back-ends (Lucene, SQL, the filesystem) that handle whatever they’re best at.

Comments 3 Comments »