Apathy of the Commons

Eight years ago, I filed a bug on an open-source project. HADOOP-3733 appeared to be a minor problem with special characters in URLs. I hadn’t bothered to examine the source code, but I assumed it would be an easy fix. Who knows, maybe it would even give some eager young programmer the opportunity to make… Continue reading Apathy of the Commons

Clojure-Hadoop 1.0.0

At long last, I have made a formal release of my clojure-hadoop library. Downloads and more information here. The 1.0.0 release is documented, but not in exhaustive detail. Other people have used this successfully, but it may not support all possible Hadoop configurations. Watch video of my presentation at HadoopWorld NYC.

Hadoop Meetup 2/10

Just a little self-promotion: I’ll be presenting at the New York Hadoop User Group on Tuesday, February 10 at 6:30. I’ll talk about how I use Hadoop for AltLaw.org, including citation linking, distributed indexing, and using Clojure with Hadoop. Update 2/28: My slides from this presentation are available from the Meetup group files.

The Document-Blob Model

Update September 22, 2008: I have abandoned this model.  I’m still using Hadoop, but with a much simpler data model.  I’ll post about it at some point. … Gosh darn, it’s hard to get this right.  In my most recent work on AltLaw, I’ve been building an infrastructure for doing all my back-end data processing… Continue reading The Document-Blob Model

A Million Little Files

My PC-oriented brain says it’s easier to work with a million small files than one gigantic file. Hadoop says the opposite — big files are stored contiguously on disk, so they can be read/written efficiently. UNIX tar files work on the same principle, but Hadoop can’t read them directly because they don’t contain enough information… Continue reading A Million Little Files

Disk is the New Tape

An interesting scenario from Doug Cutting: Say you have a terabyte of data, on a disk with 10ms seek time and 100MB/s max throughput. You want to update 1% of the records. If you do it with random-access seeks, it takes 35 days to finish. On the other hand, if you scan the entire database… Continue reading Disk is the New Tape