I haven’t posted in a while — look for more later this summer.
But in the mean time, I have a question: How do you structure data such that you can efficiently manipulate it on both a large scale and a small scale at the same time?
By large scale, I mean applying a transformation or analysis efficiently over every record in a multi-gigabyte collection. Hadoop is very good at this, but it achieves its efficiency by working with collections in large chunks, typically 64 MB and up.
What you can’t do with Hadoop — at least, not efficiently — is retrieve a single record. Systems layered on top of Hadoop, like HBase, attempt to mitigate the problem, but they are still slower than, say, a relational database.
In fact, considering the history of storage and data access technologies, most of them have been geared toward efficient random access to individual records — RAM, filesystems, hard disks, RDBMS’s, etc. But as Hadoop demonstrates, random-access systems tend to be inefficient for handling very large data collections in the aggregate.
This is not merely theoreticaly musing — it’s a problem I’m trying to solve with AltLaw. I can use Hadoop to process millions of small records. The results come out in large Hadoop SequenceFiles. But then I want to provide random-access to those records via the web site. So I have to somehow “expand” the contents of those SequenceFiles into individual records and store those records in some format that provides efficient random access.
Right now, I use two very blunt instruments — Lucene indexes and plain old files. In the final stage of my processing chain, metadata and searchable text get written to a Lucene index, and the pre-rendered HTML content of each page gets written to a file on an XFS filesystem. This works, but it ends up being one of the slower parts of the process. Building multiple Lucene indexes and merging them into one big (~6 GB) index takes an hour; writing all the files to the XFS filesystem takes about 20 minutes. There is no interesting data manipulation going on here, I’m just moving data around.