Tag Archives: Hadoop
Clojure-Hadoop 1.0.0
At long last, I have made a formal release of my clojure-hadoop library. Downloads and more information here. The 1.0.0 release is documented, but not in exhaustive detail. Other people have used this successfully, but it may not support all … Continue reading
Big & Small at the same time
I haven’t posted in a while — look for more later this summer. But in the mean time, I have a question: How do you structure data such that you can efficiently manipulate it on both a large scale and … Continue reading
Hadoop Meetup 2/10
Just a little self-promotion: I’ll be presenting at the New York Hadoop User Group on Tuesday, February 10 at 6:30. I’ll talk about how I use Hadoop for AltLaw.org, including citation linking, distributed indexing, and using Clojure with Hadoop. Update … Continue reading
Antidenormalizationism
When storing any large collection of data, one of the most critical decisions one has to make is when to normalize and when to denormalize. Normalized data is good for flexibility — you can write queries to recombine things in … Continue reading
The Document-Blob Model
Update September 22, 2008: I have abandoned this model. I’m still using Hadoop, but with a much simpler data model. I’ll post about it at some point. … Gosh darn, it’s hard to get this right. In my most recent … Continue reading
EC2 Authorizations for Hadoop
I just did my first test-run of a Hadoop cluster on Amazon EC2. It’s not as tricky as it appears, although I ran into some snags, which I’ll document here. I also found these pages helpful: EC2 on Hadoop Wiki … Continue reading
A Million Little Files
My PC-oriented brain says it’s easier to work with a million small files than one gigantic file. Hadoop says the opposite — big files are stored contiguously on disk, so they can be read/written efficiently. UNIX tar files work on … Continue reading
Disk is the New Tape
An interesting scenario from Doug Cutting: Say you have a terabyte of data, on a disk with 10ms seek time and 100MB/s max throughput. You want to update 1% of the records. If you do it with random-access seeks, it … Continue reading
Continuous Integration for Data
As I told a friend recently, I’m pretty happy with the front-end code of AltLaw. It’s just a simple Ruby on Rails app that uses Solr for search and storage. The code is small and easy to maintain. What I’m … Continue reading