Digital Digressions by Stuart Sierra

Going Non-Linear

Posted on April 14, 2008 by Stuart Sierra

I recently read the expression “going non-linear” describing a person, where most people would say something like “going nuts.” Incredibly geeky; I like it.

Useful Reminders

Posted on April 11, 2008 by Stuart Sierra

Epigrams on Programming

The Problem With Common Lisp

Posted on April 4, 2008 by Stuart Sierra

… as explained by Sir Kenny,

From: Ken Tilton
Newsgroups: comp.lang.lisp
Date: Tue, 01 Apr 2008 14:53:07 -0400
Subject: Re: Newbie FAQ #2: Where’s the GUI?

Jonathan Gardner wrote:
> I know this is a FAQ, but I still don’t have any answers, at least answers that I like.

That’s because you missed FAQ #1 (“Where are the damn libraries?”) and the answer (“The Open Source Fairy has left the building. Do them your own damn self.”)

… message truncated …

URI Templates for RDF

Posted on April 4, 2008 by Stuart Sierra

There’s a school of thought that URIs should be opaque identifiers with no inherent meaning or structure. I think this is clearly a bad idea on the human-facing web, but it is more reasonable for computer-facing web services.

However, I’ve been generating a lot of RDF lately, trying to organize piles of metadata in AltLaw. I use URIs that have meaning to me, although they aren’t formally specified anywhere. I realized that a URI can represent a lot of information — not just the identity of a resource, but also its type, version, date, etc. — in a compact form. I can write code that manipulates URIs to get key information about a resource more quickly than I could by querying the RDF database.

Unfortunately, RDF query languages like SPARQL assume that URIs themselves are just pointers that do not contain any information. I could easily generate additional RDF statements containing all the same information encoded in the URI, but that would triple the size of my database and slow down my queries. (My experience so far with Sesame 2.0 is that complex queries are slow.)

What I need is a rule-based language for describing the structure of a URI and what it means. This would be similar to URI templates, and would map parts of the URI to RDF statements about that URI.

So if I make a statement about a resource at
http://altlaw.org/cases/US/1204
(note: not a real URI), the RDF database automatically infers the following (in Turtle syntax):

<http://altlaw.org/cases/US/2008-02-14/1204>
        rdf:type        <http://altlaw.org/rdf/Case> ;
        dc:identifier   "1204" ;
        dc:jurisdiction "US" ;
        dc:date         "2008-02-14" .

All Your Base

Posted on March 21, 2008 by Stuart Sierra

Let’s have the database models strut down the runway:

Relational (SQL):

Data consist of rows in tables.
Each row has multiple columns.
Each column has a fixed type.
Queries use filters and joins.
Fixed schema is defined separately from the data.
User-defined indexes improve query performance.
Robust transaction/data-integrity support.

Graph (RDF):

Data consist of nodes and labeled links between them.
Nodes may have type information.
Optional schema is defined in the same model (e.g. RDFS, OWL).
Queries use graph-based constraints and filters.
Indexes are not optimized for specific datasets.
Limited transaction support.

Tree (XML, filesystem):

Data are identified by a name and a position within a hierarchy.
Queries use hierarchies and filters.
Data may have type information.
Optional schema may be defined in the same model (XML-Schema) or outside it (DTD).
Transaction support varies by implementation.

Hash (BDB, memcached):

Data are untyped binary blobs.
Data are identified by unique keys in a flat namespace.
Only key-based lookups are possible.
Limited transaction support.

Inverted Index (Lucene/Solr):

Data are text documents.
Queries are free-form text with some special operators.
Text queries are fast.
Query results can be scored/ranked.
Joins are inefficient; denormalized data are preferred.
Limited transaction support.

Document-Oriented (CouchDB):

Data are identified by unique keys in a flat namespace.
Data have named fields.
Optional schema defined by application code.
Denormalized data are preferred.
Limited type information.

Column-Oriented (Hbase, BigTable):

Data consist of rows in tables.
Each row has multiple columns, rows are assumed to be sparse.
Data from a single column are stored contiguously to improve performance.
Designed for distributed processing environments such as Hadoop and MapReduce.

When Spam Filters and AdWords Interact

Posted on March 21, 2008 by Stuart Sierra

See image:

Design for a RESTful Version Control System

Posted on February 20, 2008 by Stuart Sierra

I couldn’t sleep last night. So I designed a RESTful interface for version control. Yeah, that’s weird.

We already have a nice model for RESTful document storage in Amazon’s S3. But S3 doesn’t do versioning. What if it did?

A PUT to a previously-unused URI creates a new resource with version number 1.

Any subsequent PUT to that URI overwrites the resource and increments the version number.

When responding to a GET, the server sends three additional headers:

X-Version-Number: the version number being sent;
X-Version-Time: the timestamp for that version; and
X-Latest-Version: the latest version number available.

If the client wants the latest version, it uses a normal GET.

If the client wants a specific version, it can include either X-Version-Number or X-Version-Time in the request header. For the former, the server returns the requested version. For the latter, the server returns the latest version recorded at or before the specified time.

Hey, presto, we’ve got S3 with versioning! But it’s not a real version control system like Subversion. Every new version overwrites the previous version unconditionally. There’s no notion of diffs or conflicts.

But wait, it’s time for everybody’s favorite HTTP verb, POST!

The client can send a POST with a body containing a diff from the previous version. If there’s a conflict, the server responds with “HTTP 409 Conflict” and shows the conflicting lines in the response body. Otherwise, it’s “HTTP 200 OK” and a new version is created.

We can make this look more like a traditional source-code version control system by adding some extra headers: X-Version-Comment and X-Version-Author for recording metadata about a revision. We could also add X-Version-Changeset with a value that identifies this revision as part of a larger set of changes.

Branching. The RESTful version control system can implement branching the same way that Subversion does: each branch goes in its own directory. Add a couple more headers: X-Branch-From says what resource this resource was branched from, and X-Branch-Version says at which version of the original resource the branch occurred. Include these headers in a PUT to create a new branch, then the server will always send them in response to GETs.

Security. Meh, if you want security, do it at the server or protocol level. Adding authentication to this interface would make it much more complicated.

Honestly, it feels like HTTP was designed for this. I’ve deviated slightly from the typical REST definition of POST, but I think my use of POST here is in keeping with the spirit of REST.

I haven’t implemented this, maybe I will some time.

At the Edge of Feasibility

Posted on February 18, 2008 by Stuart Sierra

Well, it happened. I ran out of space on the 250 GB drive I use to develop AltLaw. Not all that surprising, although it did happen sooner than I expected. I’m deleting gigabytes of cached data — file conversions, mostly — just so I can get enough space to work again.

But this makes me wonder: am I operating at the edge of what is feasible (or practical) on a single machine? The latest batch of case updates, courtesy of public.resource.org, took nearly three days to run. And I had to divide the process up into smaller batches because a memory leak — traceable, I think, to Ruby’s HTTP library — was bogging down the system. When you’re dealing with hundreds of thousands of files at a time, even the standard UNIX tools like ls and grep start to feel inadequate. It’s damn hard to get a grip on a million files at once. Just clearing some space with a simple “rm -rf” can take hours.

Maybe it’s time for Amazon Web Services. But that’s a big, scary, enterprisey world. I’m just one little script monkey. I can barely keep track of data on one machine, how could I keep track of it on a dozen machines? Then again, AWS would give me more powerful tools for managing data — a bigger pair of tongs, so to speak. Decisions, decisions.

Crack for Engineers

Posted on February 12, 2008 by Stuart Sierra

I can’t help it. I just love big, complicated systems that let you get really precise about what you’re talking about. Types, classes, ontologies, schemas, normalization, denormalization, XML, RDF, XSLT, Java, … It’s all so cool. I can happily spend hours scribbling pages of hierarchies, interfaces, specifications, file formats, and the like.

But at the end of the day, I have a pile of hand-written notes and no code. All those fancy systems I love to study are not that useful when it comes to actually doing something useful. When I want to get something done, I hack up a Ruby script. It’s not elegant, but it works.

AltLaw makes a particularly tempting target for my engineering fantasies: all that unstructured data with so much potential structure. But to get the site to actually work, I use a flat index that stores everything as (gasp) strings.

Perhaps I’m learning the virtue of the “worse is better” approach to engineering. The only problem is, “worse is better” is worse fun.

In Search of the Grand Unified Data Model

Posted on February 8, 2008 by Stuart Sierra

Since I started working on AltLaw.org a little over a year ago, I’ve struggled with the data model. I’ve got around half a million documents, totally unstructured, from a dozen different sources. Each source has its own quirks and problems that have to be corrected before it’s useful.

When I started, I was using MySQL + the filesystem + Ferret. Keeping all that in sync, even with Rails’ ActiveRecord, was tricky, and searches were slow. Now I’ve got everything stored directly in a Solr index, which is very fast for searching. But it’s not so great when I want to change, say, how I convert the source PDFs into HTML, and I have to re-index half a million documents. And it can’t store structured data very well, so I augment it with serialized hash tables and SQLite.

What I want is a storage engine that combines properties of a relational database and a full-text search engine. It would put structured fields into SQL-like tables and unstructured fields into an inverted index for text searches. I want to be able to do queries that combine structured and unstructured data, like “cases decided in the Second Circuit between 1990 and 2000 that cite Sony v. Universal and contain the phrase ‘fair use’.” I also want to store metadata about my metadata, like where it came from, how accurate I think it is, etc.

I’ve been exploring some interesting alternatives: CouchDB is probably the closest to what I want, but it’s still under heavy development. DSpace is oriented more towards archival than search. Then there’s the big RDF monster, maybe the most flexible data model ever, which makes it correspondingly difficult to grasp.

I’m come to the conclusion that there is no perfect data model. Or, to put it another way, no storage engine is going to do my work for me. What I need to do is come up with a solid, flexible API that provides just the features I need, then put that API in front of one or more back-ends (Lucene, SQL, the filesystem) that handle whatever they’re best at.