Eight years ago, I filed a bug on an open-source project. HADOOP-3733 appeared to be a minor problem with special characters in URLs. I hadn’t bothered to examine the source code, but I assumed it would be an easy fix. Who knows, maybe it would even give some eager young programmer the opportunity to make… Continue reading Apathy of the Commons
At long last, I have made a formal release of my clojure-hadoop library. Downloads and more information here. The 1.0.0 release is documented, but not in exhaustive detail. Other people have used this successfully, but it may not support all possible Hadoop configurations. Watch video of my presentation at HadoopWorld NYC.
I haven’t posted in a while — look for more later this summer. But in the mean time, I have a question: How do you structure data such that you can efficiently manipulate it on both a large scale and a small scale at the same time? By large scale, I mean applying a transformation… Continue reading Big & Small at the same time
Just a little self-promotion: I’ll be presenting at the New York Hadoop User Group on Tuesday, February 10 at 6:30. I’ll talk about how I use Hadoop for AltLaw.org, including citation linking, distributed indexing, and using Clojure with Hadoop. Update 2/28: My slides from this presentation are available from the Meetup group files.
When storing any large collection of data, one of the most critical decisions one has to make is when to normalize and when to denormalize. Normalized data is good for flexibility — you can write queries to recombine things in any combination. Denormalized data is more efficient when you know, in advance, what the queries… Continue reading Antidenormalizationism
Update September 22, 2008: I have abandoned this model. I’m still using Hadoop, but with a much simpler data model. I’ll post about it at some point. … Gosh darn, it’s hard to get this right. In my most recent work on AltLaw, I’ve been building an infrastructure for doing all my back-end data processing… Continue reading The Document-Blob Model
I just did my first test-run of a Hadoop cluster on Amazon EC2. It’s not as tricky as it appears, although I ran into some snags, which I’ll document here. I also found these pages helpful: EC2 on Hadoop Wiki and manAmplified. First, make sure the EC2 API tools are installed and on your path.… Continue reading EC2 Authorizations for Hadoop
My PC-oriented brain says it’s easier to work with a million small files than one gigantic file. Hadoop says the opposite — big files are stored contiguously on disk, so they can be read/written efficiently. UNIX tar files work on the same principle, but Hadoop can’t read them directly because they don’t contain enough information… Continue reading A Million Little Files
An interesting scenario from Doug Cutting: Say you have a terabyte of data, on a disk with 10ms seek time and 100MB/s max throughput. You want to update 1% of the records. If you do it with random-access seeks, it takes 35 days to finish. On the other hand, if you scan the entire database… Continue reading Disk is the New Tape
As I told a friend recently, I’m pretty happy with the front-end code of AltLaw. It’s just a simple Ruby on Rails app that uses Solr for search and storage. The code is small and easy to maintain. What I’m not happy with is the back-end code, the data extraction, formatting, and indexing. It’s a… Continue reading Continuous Integration for Data