Tag Archives: Hadoop

Clojure-Hadoop 1.0.0

At long last, I have made a formal release of my clojure-hadoop library. Downloads and more information here. The 1.0.0 release is documented, but not in exhaustive detail. Other people have used this successfully, but it may not support all … Continue reading

Posted in Programming | Tagged , | Leave a comment

Big & Small at the same time

I haven’t posted in a while — look for more later this summer. But in the mean time, I have a question: How do you structure data such that you can efficiently manipulate it on both a large scale and … Continue reading

Posted in Programming | Tagged , | 4 Comments

Hadoop Meetup 2/10

Just a little self-promotion: I’ll be presenting at the New York Hadoop User Group on Tuesday, February 10 at 6:30. I’ll talk about how I use Hadoop for AltLaw.org, including citation linking, distributed indexing, and using Clojure with Hadoop. Update … Continue reading

Posted in Programming | Tagged , | 4 Comments

Antidenormalizationism

When storing any large collection of data, one of the most critical decisions one has to make is when to normalize and when to denormalize.  Normalized data is good for flexibility — you can write queries to recombine things in … Continue reading

Posted in Programming | Tagged , , , | 3 Comments

The Document-Blob Model

Update September 22, 2008: I have abandoned this model.  I’m still using Hadoop, but with a much simpler data model.  I’ll post about it at some point. … Gosh darn, it’s hard to get this right.  In my most recent … Continue reading

Posted in Programming | Tagged , , | 3 Comments

EC2 Authorizations for Hadoop

I just did my first test-run of a Hadoop cluster on Amazon EC2. It’s not as tricky as it appears, although I ran into some snags, which I’ll document here. I also found these pages helpful: EC2 on Hadoop Wiki … Continue reading

Posted in Programming | Tagged , , | 1 Comment

A Million Little Files

My PC-oriented brain says it’s easier to work with a million small files than one gigantic file. Hadoop says the opposite — big files are stored contiguously on disk, so they can be read/written efficiently. UNIX tar files work on … Continue reading

Posted in Programming | Tagged | 20 Comments

Disk is the New Tape

An interesting scenario from Doug Cutting: Say you have a terabyte of data, on a disk with 10ms seek time and 100MB/s max throughput. You want to update 1% of the records. If you do it with random-access seeks, it … Continue reading

Posted in Programming | Tagged | 4 Comments

Continuous Integration for Data

As I told a friend recently, I’m pretty happy with the front-end code of AltLaw.  It’s just a simple Ruby on Rails app that uses Solr for search and storage.  The code is small and easy to maintain. What I’m … Continue reading

Posted in Programming | Tagged , , | Leave a comment