Update September 22, 2008: I have abandoned this model. I’m still using Hadoop, but with a much simpler data model. I’ll post about it at some point.
I’ve described this before, but here’s a summary of my situation:
- I have a few million files, ~15 GB total.
- Many files represent the same logical entity, sometimes in different formats or from different sources.
- Every file needs 5-10 steps of clean-up, data extraction, and format conversion.
- The files come from ~15 different sources, each requiring different processing.
- I get 20-50 new files daily.
I want to be able to process all this efficiently, but I also want to be able to change my mind. I can never say, “I’ve run this process on this batch of files, so I never need to do it again.” I might improve the code, or I might find that I need to go back to the original files to get some other kind of data.
Hadoop was the obvious choice for efficiency. Thrift is a good, compact data format that’s easier to use than Hadoop’s native data structures. The only question is, what’s the schema? Or, more simply, what do I want to store?
I’ve come up with what, for want of a better term, I call the Document-Blob Model.
I start with a collection of Documents. Each Document represents a single, logical entity, like a court case or a section of statute. A Document contains an integer identifier and an array of Blobs, nothing more.
What is a Blob? Good question. It’s data, any data. It may represent a normal file, in which case it stores the content of that file and some metadata like the MIME type. It may also represent a data structure. Because unused fields do not occupy any space in Thrift’s binary format, the Blob type can have fields for every structure I might want to use now or in the future. In effect, a Blob is a polymorphic type that can become any other type.
So how do I know which type it is? By where it came from. Each Blob is tagged with the name of its “provider”. For files downloaded in bulk, the provider is the web site or service where I got them. For generated files, the creator is a class or script, with a version number.
So I have a few hundred thousand Documents, each containing several Blobs. I represent each conversion/extraction/processing step as its own Java class. All those classes implement the same, simple interface: take one Blob as input and return another Blob as output.
Helper functions allow me to say things like, “Take this Document, find the Blob that was generated by class X. Run class Y on that Blob, and append result to the original Document.” In this way, I can stack multiple processing steps into a single Hadoop job, but retain the option of reusing or rearranging those steps later on.
Will this work? I have no idea. I just came up with it last week. Today I successfully ran a 5-step job on ~700,000 documents from the public.resource.org federal case corpus. It took about an hour on a 10-node Hadoop/EC2 cluster.
The real test will come when I apply this model to the much messier collection of files we downloaded directly from the federal courts.