Scalability for dummies like me

Alex Barrera wrote a nice little article about why “scalability issues” can prevent any visible progress on a web project for months at a time: Scalability Issues for Dummies.

I’ve been in this position — no visible progress while redesigning a back-end — with AltLaw several times now. I’m contemplating yet another redesign right now, and I don’t know if I can stand it.

Big & Small at the same time

I haven’t posted in a while — look for more later this summer.

But in the mean time, I have a question: How do you structure data such that you can efficiently manipulate it on both a large scale and a small scale at the same time?

By large scale, I mean applying a transformation or analysis efficiently over every record in a multi-gigabyte collection.  Hadoop is very good at this, but it achieves its efficiency by working with collections in large chunks, typically 64 MB and up.

What you can’t do with Hadoop — at least, not efficiently — is retrieve a single record.  Systems layered on top of Hadoop, like HBase, attempt to mitigate the problem, but they are still slower than, say, a relational database.

In fact, considering the history of storage and data access technologies, most of them have been geared toward efficient random access to individual records — RAM, filesystems, hard disks, RDBMS’s, etc.  But as Hadoop demonstrates, random-access systems tend to be inefficient for handling very large data collections in the aggregate.

This is not merely theoreticaly musing — it’s a problem I’m trying to solve with AltLaw.  I can use Hadoop to process millions of small records.  The results come out in large Hadoop SequenceFiles.  But then I want to provide random-access to those records via the web site.  So I have to somehow “expand” the contents of those SequenceFiles into individual records and store those records in some format that provides efficient random access.

Right now, I use two very blunt instruments — Lucene indexes and plain old files.  In the final stage of my processing chain, metadata and searchable text get written to a Lucene index, and the pre-rendered HTML content of each page gets written to a file on an XFS filesystem.  This works, but it ends up being one of the slower parts of the process.  Building multiple Lucene indexes and merging them into one big (~6 GB) index takes an hour; writing all the files to the XFS filesystem takes about 20 minutes.  There is no interesting data manipulation going on here, I’m just moving data around.

Dirty Necessary Money

Up at Cornell, Tom Bruce has a post about the problem of funding open access to legal materials. This brings to mind a conversation I had with a doctor friend recently about AltLaw. My friend, accustomed to the open-access requirements of NIH grants, was frankly shocked that there are no comparable rules for legal decisions.

NIH Public Access Policy
Screenshot: PubMed home page

A related problem is how to make people aware of what free services are available. AltLaw has been around for two years, and while traffic has grown steadily, it has not gotten as much attention as commercial startups operating similar services. Admittedly, we have done no advertising at all, and that’s our fault. “If you build it they will come” we thought, naïvely. But how would we advertise? I’m a programmer; the people I work with are law professors. None of us know the first thing about marketing, and quite frankly, none of us care. Seen in that light, Cornell’s recent partnership with Justia.com is a smart move that will benefit everyone working on open-access law, since it will expose more lawyers to the idea.

Antidenormalizationism

When storing any large collection of data, one of the most critical decisions one has to make is when to normalize and when to denormalize.  Normalized data is good for flexibility — you can write queries to recombine things in any combination.  Denormalized data is more efficient when you know, in advance, what the queries will be.  There’s a maxim, “normalize until it hurts, denormalize until it works.”

In the past two years, I’ve become a big fan of three software paradigms:

  1. RDF and the Semantic Web
  2. RESTful architectures
  3. Hadoop and MapReduce

The first two were made for each other — RDF is the ideal data format for the Semantic Web, and Semantic Web applications are (ideally) implemented with RESTful architectures.

Unfortunately, as I discovered through painful experience, RDF is an inefficient representation for large amounts of data.  It’s the ultimate in normalization — each triple represents a single fact — and in flexibility, but even simple queries become prohibitively slow when each item of interest is spread across dozens of triples.

At the other end of the spectrum, MapReduce programming forces you into an entirely different way of thinking.  MapReduce offers exactly one data access mechanism: read a list of records from beginning to end.  No random access, do not pass GO, do not collect $200.  That restriction is what enables Hadoop to distribute a job across many machines, and to process data at the maximum rate supported by the hard disk.  Obviously, to take full advantage of these optimizations, your MapReduce program needs to be able to process each record in isolation, without referring to any other resources.  In effect, everything has to be completely denormalized.

For my work on AltLaw, I’ve tended to use fully denormalized data, because anything else is too slow.  But I want to be working with normalized data.  I want to have every datum — plus information about its derivation — stored in one giant RDF graph.  But I also want to be able to process this graph efficiently.  Maybe if I had a dozen machines with 64 GB of memory apiece, this wouldn’t be a problem.  But with one desktop, one server, and occasional rides on EC2, that’s not an option.

The ideal design, I think, would be a hybrid system which can do efficient bulk processing of RDF data.  There are some pilot projects to do this with Hadoop, and I’m interested to see how they pan out.  But until then, I’ll have to make do as best I can.

Update: The projects I remebered were HRDF and Heart, which turn out to be the same thing: http://code.google.com/p/hrdf/ and http://wiki.apache.org/incubator/HeartProposal

Fragmentation and the Failure of the Web

What makes an on-line community?  In the past two weeks I have received announcements of three new “communities” all interested in using open-source software to retrieve, share, and analyze data from or about governments.  Most of these announcements say the same thing: “A lot people seem to be working on this, but they aren’t talking to each other.”  Each group has a slightly different slant, but in my mind I lump them all under the heading “Semantic Government,” i.e. building the semantic web for government data.

I started casting out a few search queries, and quickly compiled a list of eight different mailing lists and/or wikis devoted to this subject.  That doesn’t include for-profits like Justia.com or larger non-profits like the Sunlight Foundation.

This is a problem.  Not only do I have to subscribe to half a dozen mailing lists to keep abreast of what others are doing, I also have to cross-post to several lists when I want to announce something myself.  So far, nothing I have posted to these lists has garnered as much response as private emails sent directly to people whom I know are subscribed to the lists.

Perhaps the very idea of a “web-based community” has become a victim of its own success.  Back in the olden days, when I was still learning how to type, creating an on-line community was hard.  You had to wrangle with BBS software, mailing list managers, or content management systems.  It took dedicated individuals willing to invest considerable time and money.  Now?  Just go to Google / Yahoo / Facebook / whatever flavor-of-the-month service, type in the name of your group and presto, you’re a “community.”

The problem is that it’s now easier to start a group than to join one.  Every project wants to be the center of its own community, but what most projects actually get is a lonely soapbox in the wilderness from which to cry, “Announcing version 0.x…”

I’m equally guilty in this trend, having founded one of the sites I referred to above (LawCommons) and built a wiki for another (IGOTF).  Once you’ve started a site it’s easier to leave it there than to formally announce “I am shutting down X and throwing my lot in with Y.”  It’s also a hedge against the (very likely) possibility that group Y won’t be around in a year.  But I worry that a broad movement (Semantic Government) fragmented into so many tiny sub-groups will never gather enough momentum to succeed.  The very thing we all want — to share information better — is lost through the scattered efforts to achieve it.

The Document-Blob Model

Update September 22, 2008: I have abandoned this model.  I’m still using Hadoop, but with a much simpler data model.  I’ll post about it at some point.

Gosh darn, it’s hard to get this right.  In my most recent work on AltLaw, I’ve been building an infrastructure for doing all my back-end data processing using Hadoop and Thrift.

I’ve described this before, but here’s a summary of my situation:

  1. I have a few million files, ~15 GB total.
  2. Many files represent the same logical entity, sometimes in different formats or from different sources.
  3. Every file needs 5-10 steps of clean-up, data extraction, and format conversion.
  4. The files come from ~15 different sources, each requiring different processing.
  5. I get 20-50 new files daily.

I want to be able to process all this efficiently, but I also want to be able to change my mind.  I can never say, “I’ve run this process on this batch of files, so I never need to do it again.”  I might improve the code, or I might find that I need to go back to the original files to get some other kind of data.

Hadoop was the obvious choice for efficiency.  Thrift is a good, compact data format that’s easier to use than Hadoop’s native data structures.  The only question is, what’s the schema?  Or, more simply, what do I want to store?

I’ve come up with what, for want of a better term, I call the Document-Blob Model.

I start with a collection of Documents.  Each Document represents a single, logical entity, like a court case or a section of statute.  A Document contains an integer identifier and an array of Blobs, nothing more.

What is a Blob?  Good question.  It’s data, any data.  It may represent a normal file, in which case it stores the content of that file and some metadata like the MIME type.  It may also represent a data structure.  Because unused fields do not occupy any space in Thrift’s binary format, the Blob type can have fields for every structure I might want to use now or in the future.  In effect, a Blob is a polymorphic type that can become any other type.

So how do I know which type it is?  By where it came from.  Each Blob is tagged with the name of its “provider”.  For files downloaded in bulk, the provider is the web site or service where I got them.  For generated files, the creator is a class or script, with a version number.

So I have a few hundred thousand Documents, each containing several Blobs.  I represent each conversion/extraction/processing step as its own Java class.  All those classes implement the same, simple interface: take one Blob as input and return another Blob as output.

Helper functions allow me to say things like, “Take this Document, find the Blob that was generated by class X.  Run  class Y on that Blob, and append result to the original Document.”  In this way, I can stack multiple processing steps into a single Hadoop job, but retain the option of reusing or rearranging those steps later on.

Will this work?  I have no idea.  I just came up with it last week.  Today I successfully ran a 5-step job on ~700,000 documents from the public.resource.org federal case corpus.  It took about an hour on a 10-node Hadoop/EC2 cluster.

The real test will come when I apply this model to the much messier collection of files we downloaded directly from the federal courts.