Archive for the “Programming” Category


Epigrams on Programming

Comments No Comments »

… as explained by Sir Kenny,

From: Ken Tilton
Newsgroups: comp.lang.lisp
Date: Tue, 01 Apr 2008 14:53:07 -0400
Subject: Re: Newbie FAQ #2: Where’s the GUI?

Jonathan Gardner wrote:
> I know this is a FAQ, but I still don’t have any answers, at least answers that I like.

That’s because you missed FAQ #1 (”Where are the damn libraries?”) and the answer (”The Open Source Fairy has left the building. Do them your own damn self.”)

… message truncated …

Comments No Comments »

Let’s have the database models strut down the runway:

Relational (SQL):

  • Data consist of rows in tables.
  • Each row has multiple columns.
  • Each column has a fixed type.
  • Queries use filters and joins.
  • Fixed schema is defined separately from the data.
  • User-defined indexes improve query performance.
  • Robust transaction/data-integrity support.

Graph (RDF):

  • Data consist of nodes and labeled links between them.
  • Nodes may have type information.
  • Optional schema is defined in the same model (e.g. RDFS, OWL).
  • Queries use graph-based constraints and filters.
  • Indexes are not optimized for specific datasets.
  • Limited transaction support.

Tree (XML, filesystem):

  • Data are identified by a name and a position within a hierarchy.
  • Queries use hierarchies and filters.
  • Data may have type information.
  • Optional schema may be defined in the same model (XML-Schema) or outside it (DTD).
  • Transaction support varies by implementation.

Hash (BDB, memcached):

  • Data are untyped binary blobs.
  • Data are identified by unique keys in a flat namespace.
  • Only key-based lookups are possible.
  • Limited transaction support.

Inverted Index (Lucene/Solr):

  • Data are text documents.
  • Queries are free-form text with some special operators.
  • Text queries are fast.
  • Query results can be scored/ranked.
  • Joins are inefficient; denormalized data are preferred.
  • Limited transaction support.

Document-Oriented (CouchDB):

  • Data are identified by unique keys in a flat namespace.
  • Data have named fields.
  • Optional schema defined by application code.
  • Denormalized data are preferred.
  • Limited type information.

Column-Oriented (Hbase, BigTable):

  • Data consist of rows in tables.
  • Each row has multiple columns, rows are assumed to be sparse.
  • Data from a single column are stored contiguously to improve performance.
  • Designed for distributed processing environments such as Hadoop and MapReduce.

Comments No Comments »

I couldn’t sleep last night. So I designed a RESTful interface for version control. Yeah, that’s weird.

We already have a nice model for RESTful document storage in Amazon’s S3. But S3 doesn’t do versioning. What if it did?

A PUT to a previously-unused URI creates a new resource with version number 1.

Any subsequent PUT to that URI overwrites the resource and increments the version number.

When responding to a GET, the server sends three additional headers:

  1. X-Version-Number: the version number being sent;
  2. X-Version-Time: the timestamp for that version; and
  3. X-Latest-Version: the latest version number available.

If the client wants the latest version, it uses a normal GET.

If the client wants a specific version, it can include either X-Version-Number or X-Version-Time in the request header. For the former, the server returns the requested version. For the latter, the server returns the latest version recorded at or before the specified time.

Hey, presto, we’ve got S3 with versioning! But it’s not a real version control system like Subversion. Every new version overwrites the previous version unconditionally. There’s no notion of diffs or conflicts.

But wait, it’s time for everybody’s favorite HTTP verb, POST!

The client can send a POST with a body containing a diff from the previous version. If there’s a conflict, the server responds with “HTTP 409 Conflict” and shows the conflicting lines in the response body. Otherwise, it’s “HTTP 200 OK” and a new version is created.

We can make this look more like a traditional source-code version control system by adding some extra headers: X-Version-Comment and X-Version-Author for recording metadata about a revision. We could also add X-Version-Changeset with a value that identifies this revision as part of a larger set of changes.

Branching. The RESTful version control system can implement branching the same way that Subversion does: each branch goes in its own directory. Add a couple more headers: X-Branch-From says what resource this resource was branched from, and X-Branch-Version says at which version of the original resource the branch occurred. Include these headers in a PUT to create a new branch, then the server will always send them in response to GETs.

Security. Meh, if you want security, do it at the server or protocol level. Adding authentication to this interface would make it much more complicated.

Honestly, it feels like HTTP was designed for this. I’ve deviated slightly from the typical REST definition of POST, but I think my use of POST here is in keeping with the spirit of REST.

I haven’t implemented this, maybe I will some time.

Comments No Comments »

Well, it happened. I ran out of space on the 250 GB drive I use to develop AltLaw. Not all that surprising, although it did happen sooner than I expected. I’m deleting gigabytes of cached data — file conversions, mostly — just so I can get enough space to work again.

But this makes me wonder: am I operating at the edge of what is feasible (or practical) on a single machine? The latest batch of case updates, courtesy of public.resource.org, took nearly three days to run. And I had to divide the process up into smaller batches because a memory leak — traceable, I think, to Ruby’s HTTP library — was bogging down the system. When you’re dealing with hundreds of thousands of files at a time, even the standard UNIX tools like ls and grep start to feel inadequate. It’s damn hard to get a grip on a million files at once. Just clearing some space with a simple “rm -rf” can take hours.

Maybe it’s time for Amazon Web Services. But that’s a big, scary, enterprisey world. I’m just one little script monkey. I can barely keep track of data on one machine, how could I keep track of it on a dozen machines? Then again, AWS would give me more powerful tools for managing data — a bigger pair of tongs, so to speak. Decisions, decisions.

Comments No Comments »

I can’t help it. I just love big, complicated systems that let you get really precise about what you’re talking about. Types, classes, ontologies, schemas, normalization, denormalization, XML, RDF, XSLT, Java, … It’s all so cool. I can happily spend hours scribbling pages of hierarchies, interfaces, specifications, file formats, and the like.

But at the end of the day, I have a pile of hand-written notes and no code. All those fancy systems I love to study are not that useful when it comes to actually doing something useful. When I want to get something done, I hack up a Ruby script. It’s not elegant, but it works.

AltLaw makes a particularly tempting target for my engineering fantasies: all that unstructured data, just waiting to be tied to a big database table, bound with constraints, and queried relentlessly. But to get the site to actually work, I use a flat index that stores everything as (gasp) strings.

Perhaps I’m learning the virtue of the “worse is better” approach to engineering. The only problem is, “worse is better” is worse fun.

Comments No Comments »

Since I started working on AltLaw.org a little over a year ago, I’ve struggled with the data model. I’ve got around half a million documents, totally unstructured, from a dozen different sources. Each source has its own quirks and problems that have to be corrected before it’s useful.

When I started, I was using MySQL + the filesystem + Ferret. Keeping all that in sync, even with Rails’ ActiveRecord, was tricky, and searches were slow. Now I’ve got everything stored directly in a Solr index, which is very fast for searching. But it’s not so great when I want to change, say, how I convert the source PDFs into HTML, and I have to re-index half a million documents. And it can’t store structured data very well, so I augment it with serialized hash tables and SQLite.

What I want is a storage engine that combines properties of a relational database and a full-text search engine. It would put structured fields into SQL-like tables and unstructured fields into an inverted index for text searches. I want to be able to do queries that combine structured and unstructured data, like “cases decided in the Second Circuit between 1990 and 2000 that cite Sony v. Universal and contain the phrase ‘fair use’.” I also want to store metadata about my metadata, like where it came from, how accurate I think it is, etc.

I’ve been exploring some interesting alternatives: CouchDB is probably the closest to what I want, but it’s still under heavy development. DSpace is oriented more towards archival than search. Then there’s the big RDF monster, maybe the most flexible data model ever, which makes it correspondingly difficult to grasp.

I’m come to the conclusion that there is no perfect data model. Or, to put it another way, no storage engine is going to do my work for me. What I need to do is come up with a solid, flexible API that provides just the features I need, then put that API in front of one or more back-ends (Lucene, SQL, the filesystem) that handle whatever they’re best at.

Comments 3 Comments »

The most famous piece of Lisp-related vaporware is vapor no longer: Arc has been released. After paging through the tutorial, I’m a bit underwhelmed. It looks like just a bunch of syntactic sugar implemented on top of Scheme. Clojure is more interesting and more innovative. Clojure and Arc have some things in common: data structures are functions on their keys, strings are sequences, and lambda is “fn”. But Arc seems to be more in the traditional, semi-imperative model whereas Clojure borrows heavily from functional programming. Arc even has for-loops, for goodness sakes. Obviously, I have no idea what else Paul Graham may have up his sleeve, so maybe there’s much more exciting stuff to come. But for now Arc looks to me like a modest improvement on Common Lisp / Scheme rather than a major new language.

Comments No Comments »

I finally found out how to do this, from the Rails Routing shortcut by David Black. In the Rails console, do this:

include ActionController::UrlWriter
default_url_options[:host] = 'whatever'

Then you can call your named route methods directly from the console.

Comments 1 Comment »

I am happy to report that AltLaw.org’s switch to Solr has worked very well. Solr is a RESTful search engine, built on Lucene. The setup was more complicated than just using a search library, but the rewards were worth it.

Before, I was using Ferret, which I still like. It’s a great library, and Dave Balmain has done incredible work producing a fast search engine that integrates well with Ruby. I still use it on other sites. But Solr was a better fit for AltLaw.

With Ferret, I was trying to shoe-horn large, unstructured documents into a system — ActiveRecord, MySQL, and acts_as_ferret — that is better suited to small, structured records. Now I use Solr as both a search engine and a document store, eliminating MySQL from the picture. That, combined with Solr’s built-in caching, has dramatically decreased the server load (from around 2.00 to under 0.30) while visibly improving search performance.

Also, I think it helps that Solr is not integrated with Ruby. The solr-ruby gem is not well documented, but easy to figure out, as it’s just a thin wrapper over the Solr APIs. Having the search engine in a separate process made it easier to separate the indexing & searching part of the code from the web application. As a result, the Rails code shrunk to one-fourth its former size.

Comments No Comments »