Tag: Hadoop

Apathy of the Commons

Eight years ago, I filed a bug on an open-source project. HADOOP-3733 appeared to be a minor problem with special characters in URLs. I hadn’t bothered to examine the source code, but I assumed it would be an easy fix. Who knows, maybe it would even give some eager young programmer the opportunity to make…

Read the full article

Clojure-Hadoop 1.0.0

At long last, I have made a formal release of my clojure-hadoop library. Downloads and more information here. The 1.0.0 release is documented, but not in exhaustive detail. Other people have used this successfully, but it may not support all possible Hadoop configurations. Watch video of my presentation at HadoopWorld NYC.

Big & Small at the same time

I haven’t posted in a while — look for more later this summer. But in the mean time, I have a question: How do you structure data such that you can efficiently manipulate it on both a large scale and a small scale at the same time? By large scale, I mean applying a transformation…

Read the full article

Hadoop Meetup 2/10

Just a little self-promotion: I’ll be presenting at the New York Hadoop User Group on Tuesday, February 10 at 6:30. I’ll talk about how I use Hadoop for AltLaw.org, including citation linking, distributed indexing, and using Clojure with Hadoop. Update 2/28: My slides from this presentation are available from the Meetup group files.

Antidenormalizationism

When storing any large collection of data, one of the most critical decisions one has to make is when to normalize and when to denormalize. Normalized data is good for flexibility — you can write queries to recombine things in any combination. Denormalized data is more efficient when you know, in advance, what the queries…

Read the full article

The Document-Blob Model

Update September 22, 2008: I have abandoned this model. I’m still using Hadoop, but with a much simpler data model. I’ll post about it at some point. … Gosh darn, it’s hard to get this right. In my most recent work on AltLaw, I’ve been building an infrastructure for doing all my back-end data processing…

Read the full article

EC2 Authorizations for Hadoop

I just did my first test-run of a Hadoop cluster on Amazon EC2. It’s not as tricky as it appears, although I ran into some snags, which I’ll document here. I also found these pages helpful: EC2 on Hadoop Wiki and manAmplified. First, make sure the EC2 API tools are installed and on your path.…

Read the full article

A Million Little Files

My PC-oriented brain says it’s easier to work with a million small files than one gigantic file. Hadoop says the opposite — big files are stored contiguously on disk, so they can be read/written efficiently. UNIX tar files work on the same principle, but Hadoop can’t read them directly because they don’t contain enough information…

Read the full article

Disk is the New Tape

An interesting scenario from Doug Cutting: Say you have a terabyte of data, on a disk with 10ms seek time and 100MB/s max throughput. You want to update 1% of the records. If you do it with random-access seeks, it takes 35 days to finish. On the other hand, if you scan the entire database…

Read the full article

Continuous Integration for Data

As I told a friend recently, I’m pretty happy with the front-end code of AltLaw. It’s just a simple Ruby on Rails app that uses Solr for search and storage. The code is small and easy to maintain. What I’m not happy with is the back-end code, the data extraction, formatting, and indexing. It’s a…

Read the full article