Archive for July, 2008

Update September 22, 2008: I have abandoned this model.  I’m still using Hadoop, but with a much simpler data model.  I’ll post about it at some point.

Gosh darn, it’s hard to get this right.  In my most recent work on AltLaw, I’ve been building an infrastructure for doing all my back-end data processing using Hadoop and Thrift.

I’ve described this before, but here’s a summary of my situation:

  1. I have a few million files, ~15 GB total.
  2. Many files represent the same logical entity, sometimes in different formats or from different sources.
  3. Every file needs 5-10 steps of clean-up, data extraction, and format conversion.
  4. The files come from ~15 different sources, each requiring different processing.
  5. I get 20-50 new files daily.

I want to be able to process all this efficiently, but I also want to be able to change my mind.  I can never say, “I’ve run this process on this batch of files, so I never need to do it again.”  I might improve the code, or I might find that I need to go back to the original files to get some other kind of data.

Hadoop was the obvious choice for efficiency.  Thrift is a good, compact data format that’s easier to use than Hadoop’s native data structures.  The only question is, what’s the schema?  Or, more simply, what do I want to store?

I’ve come up with what, for want of a better term, I call the Document-Blob Model.

I start with a collection of Documents.  Each Document represents a single, logical entity, like a court case or a section of statute.  A Document contains an integer identifier and an array of Blobs, nothing more.

What is a Blob?  Good question.  It’s data, any data.  It may represent a normal file, in which case it stores the content of that file and some metadata like the MIME type.  It may also represent a data structure.  Because unused fields do not occupy any space in Thrift’s binary format, the Blob type can have fields for every structure I might want to use now or in the future.  In effect, a Blob is a polymorphic type that can become any other type.

So how do I know which type it is?  By where it came from.  Each Blob is tagged with the name of its “provider”.  For files downloaded in bulk, the provider is the web site or service where I got them.  For generated files, the creator is a class or script, with a version number.

So I have a few hundred thousand Documents, each containing several Blobs.  I represent each conversion/extraction/processing step as its own Java class.  All those classes implement the same, simple interface: take one Blob as input and return another Blob as output.

Helper functions allow me to say things like, “Take this Document, find the Blob that was generated by class X.  Run  class Y on that Blob, and append result to the original Document.”  In this way, I can stack multiple processing steps into a single Hadoop job, but retain the option of reusing or rearranging those steps later on.

Will this work?  I have no idea.  I just came up with it last week.  Today I successfully ran a 5-step job on ~700,000 documents from the public.resource.org federal case corpus.  It took about an hour on a 10-node Hadoop/EC2 cluster.

The real test will come when I apply this model to the much messier collection of files we downloaded directly from the federal courts.

Comments 2 Comments »

Google recently released its Protocol Buffers as open source. About a year ago, Facebook released a similar product called Thrift. I’ve been comparing them; here’s what I’ve found:


Thrift Protocol Buffers
Backers Facebook, Apache (accepted for incubation) Google
Bindings C++, Java, Python, PHP, XSD, Ruby, C#, Perl, Objective C, Erlang, Smalltalk, OCaml, and Haskell C++, Java, Python
(Perl, Ruby, and C# under discussion)
Output Formats Binary, JSON Binary
Primitive Types bool
byte
16/32/64-bit integers

double
string
byte sequence
map<t1,t2>
list<t>
set<t>
bool

32/64-bit integers
float
double
string
byte sequence

“repeated” properties act like lists
Enumerations Yes Yes
Constants Yes No
Composite Type struct message
Exception Type Yes No
Documentation So-so Good
License BSD-style Apache
Compiler Language C++ C++
RPC Interfaces Yes Yes
RPC Implementation Yes No
Composite Type Extensions No Yes

Overall, I think Thrift wins on features and Protocol Buffers win on
documentation. Implementation-wise, they’re quite similar. Both use
integer tags to identify fields, so you can add and remove fields
without breaking existing code. Protocol Buffers support
variable-width encoding of integers, which saves a few bytes. (Thrift
has an experimental output format with variable-width ints.)

The major difference is that Thrift provides a full client/server RPC
implementation, whereas Protocol Buffers only generate stubs to use in
your own RPC system.

Update July 12, 2008: I haven’t tested for speed, but from a cursory examination it seems that, at the binary level, Thrift and Protocol Buffers are very similar. I think Thrift will develop a more coherent community now that it’s under Apache incubation. It just moved to a new web site and mailing list, and the issue tracker is active.

Comments 17 Comments »

I’m sure I’m not the first to suggest this, but here goes.

Ever since somebody first thought of applying the Model-View-Controller paradigm to the web, we’ve had this:

MVC With Controller on the Server

The View is a conflation of HTML and JavaScript.  JavaScript is an afterthought, a gimmick to make pages “dynamic.”  All the real action is in the Controller, which talks to the database, processes the internal application logic, and renders templates before sending complete pages back to the client.

But what if we implement the Controller entirely in JavaScript?

MVC With Controller on the Client
Now we can put the Controller on the client, and build a RESTful HTTP interface to communicate with the database.

Obviously there are many issues to consider.  First and foremost is making sure that rogue clients cannot do anything to the database they’re not supposed to.  But that’s a manageable problem — Amazon S3 is a good example.  Apps that run entirely in the client can even be made more secure than their server-based counterparts, because encryption can be implemented entirely in the client, so that the server never sees the unencrypted data.  (Clipperz, an password-storage service, calls this a zero-knowledge web app.)

There are some interesting possibilities.  For example, the entire application, including the current state of the model, can be downloaded as a single web page for off-line use.  (Clipperz supports this.)  Also, the same application could connect to multiple data sources.  And as with any RESTful architecture, back-end scaling is relatively easy.

Update July 10, 2008: I’m always amazed when one of my posts show up on Reddit.  Maybe it was the diagrams.  In any case, thanks to everyone who sent in comments.  A couple responses:

  1. Yes, in a sense I’ve described Ajax.  But most Ajax-related code around the web these days is still in the “dynamic view” mode rather than the “client-side controller” mode.
  2. I like Sun’s MVC diagram in which the View takes an active role in rendering the model rather than being just a template.  It’s actually quite similar to what I’m suggesting here.
  3. Some MVC frameworks, such as Ruby on Rails, insist that logic in the View is bad, but then they include all these Ajax view helpers, so it’s a bit of a mixed message.
  4. I’m not insisting that all business logic be implemented client-side.  Rather, I’m assuming some kind of “smart” database, with a RESTful front-end, that’s capable of containing business logic.  Back in the day, these were SQL stored procedures.  Now it’s probably something like CouchDB.
  5. Yes, this design is bad for search engines, bookmarks, and deep linking.  But there are plenty of cases where those don’t matter.  Look at Google Mail, for example.  It basically follows the design I’ve laid out here: the entire app is one HTML page (or a very few pages) with behavior implemented in client-side JavaScript.

Update August 7, 2008: This is an example of the code-on-demand style described in Roy Fielding’s REST thesis.  Link from Joe Gregorio.

Comments 15 Comments »