Archive for April 17th, 2008

An interesting scenario from Doug Cutting: Say you have a terabyte of data, on a disk with 10ms seek time and 100MB/s max throughput. You want to update 1% of the records. If you do it with random-access seeks, it takes 35 days to finish. On the other hand, if you scan the entire database sequentially and write it back out again, it takes 5.6 hours.

This is why Hadoop only supports linear access to the filesystem. It’s also why Hadoop coder Tom White says disks have become tapes.  All this is contrary to the way I think about data — building hash tables and indexes to do fast random-access lookups.  I’m still trying to get my head around this concept of “linear” data processing.  But I have found that I can do some things faster by reading sequentially through a batch of files rather than trying to stuff everything in a database (RDF or SQL) and doing big join queries.

Comments 3 Comments »

As I told a friend recently, I’m pretty happy with the front-end code of AltLaw.  It’s just a simple Ruby on Rails app that uses Solr for search and storage.  The code is small and easy to maintain.

What I’m not happy with is the back-end code, the data extraction, formatting, and indexing.  It’s a hodge-podge of Ruby, Perl, Clojure, C, shell scripts, SQL, XML, RDF, and text files that could make the most dedicated Unix hacker blanch.  It works, but just barely, and I panic every time I think about implementing a new feature.

This is a software engineering problem rather than a pure computer science problem, and I never pretended to be a software engineer.  (I never pretended to be a computer scientist, either.)  It might also be a problem for a larger team than an army of one (plus a few volunteers).

But given that I can get more processing power (via Amazon) more easily than I can get more programmers, how can I make use of the resources I have to enhance my own productivity?

I’m studying Hadoop and Cascading in the hopes that they will help.  But those systems are inherently batch-oriented.  I’d like to move away from a batch processing model if I can.  Given that AltLaw acquires 50 to 100 new cases per day, adding to a growing database of over half a million, what I would really like to have is a kind of “continuous integration” process for data.  I want a server that runs continuously, accepting new data and new code and automatically re-running processes as needed to keep up with dependencies.  Perhaps given a year of free time I could invent this myself, but I’m too busy debugging my shell scripts.

Comments No Comments »