I haven’t posted in a while — look for more later this summer.
But in the mean time, I have a question: How do you structure data such that you can efficiently manipulate it on both a large scale and a small scale at the same time?
By large scale, I mean applying a transformation or analysis efficiently over every record in a multi-gigabyte collection. Hadoop is very good at this, but it achieves its efficiency by working with collections in large chunks, typically 64 MB and up.
What you can’t do with Hadoop — at least, not efficiently — is retrieve a single record. Systems layered on top of Hadoop, like HBase, attempt to mitigate the problem, but they are still slower than, say, a relational database.
In fact, considering the history of storage and data access technologies, most of them have been geared toward efficient random access to individual records — RAM, filesystems, hard disks, RDBMS’s, etc. But as Hadoop demonstrates, random-access systems tend to be inefficient for handling very large data collections in the aggregate.
This is not merely theoreticaly musing — it’s a problem I’m trying to solve with AltLaw. I can use Hadoop to process millions of small records. The results come out in large Hadoop SequenceFiles. But then I want to provide random-access to those records via the web site. So I have to somehow “expand” the contents of those SequenceFiles into individual records and store those records in some format that provides efficient random access.
Right now, I use two very blunt instruments — Lucene indexes and plain old files. In the final stage of my processing chain, metadata and searchable text get written to a Lucene index, and the pre-rendered HTML content of each page gets written to a file on an XFS filesystem. This works, but it ends up being one of the slower parts of the process. Building multiple Lucene indexes and merging them into one big (~6 GB) index takes an hour; writing all the files to the XFS filesystem takes about 20 minutes. There is no interesting data manipulation going on here, I’m just moving data around.
how about using a distributed key value store like tokyo cabinet/redis/memcached?
I still have to load the data into the key/value store. It’s too big to fit in RAM, at least on hardware that I can afford. Tokyo Cabinet is an option, but I haven’t had time to explore it much yet.
It’s a bit wacky, but have you considered doing a last-pass total sort (so that the outcome is efficiently sharded) and using a CSV storage engine? MySQL has one and so do others I believe.
If it’s the merge that’s the problem, does just sharding help — doing a final sort so each key knows which reducer’s (shard’s) index to query on? (or querying against all in parallel?)
Philip- That’s an interesting idea. But as far as I can tell from the docs, the CSV storage engine can’t retrieve a single record without a full-table scan.
A similar approach, that I want to try, is making the final Hadoop job output a MapFile, basically a SequenceFile with an index. Then just dump the MapFiles on a file system and retrieve the records from there. But I haven’t found much discussion of MapFiles outside of the API docs, so I don’t know if they are really suitable for this kind of use.