Author: Stuart

We Don’t Know How We Program

Paul Johnson, in the U.K., wrote a piece about how there is no known “process” for programming. At some point, all the theory and methodology goes out the window and someone has to sit down, think about the problem, and write some code. I’m sure I won’t be the only one to suggest this, but…

Read the full article

Calling Java Constructors with this()

The things I don’t know about Java… could fill a book. Here’s a new one, from the Hadoop sources: public ArrayWritable(Class valueClass) { // … } public ArrayWritable(Class valueClass, Writable[] values) { this(valueClass); this.values = values; } The second constructor uses the syntax this(arg) to call a different constructor, then follows with initialization code of…

Read the full article

Astronauts Without Mission Control

Joel Spolsky complains that architecture astronauts are taking over at big, rich companies like Google and Microsoft, pushing out elaborate architectural systems that don’t solve actual problems. He’s right in that smart, technical people like to take on any large, abstract problem that is, as he puts it, “a fun programming exercise that you’re doing…

Read the full article

A Million Little Files

My PC-oriented brain says it’s easier to work with a million small files than one gigantic file. Hadoop says the opposite — big files are stored contiguously on disk, so they can be read/written efficiently. UNIX tar files work on the same principle, but Hadoop can’t read them directly because they don’t contain enough information…

Read the full article

The Great Database Rewrite

I just discovered the paper The End of an Architectural Era (It’s Time for a Complete Rewrite), about re-designing database software from the ground up. It contains some unsurprising predictions — “the next decade will bring domination by shared-nothing computer systems, often called grid computing” — and some interesting ideas: Any database smaller than 1…

Read the full article

Power At Your Fingertips

I just ran my first Amazon EC2 instance. Kind of a heady feeling, having nearly unlimited computing power just a few keystrokes away. I got the same feeling the first time I logged in as root on a dedicated web server. I gotta say, though, that the ticking meter — even at just $0.10/hour —…

Read the full article

There Is No Database

I think I’m starting to get a handle on how Hadoop is supposed to work. The MapReduce model isn’t what troubles me. The mind-bending part is that there is no database. Everything happens by scanning big files from beginning to end. It’s like everything I learned about data structures with O(log n) access no longer…

Read the full article

Disk is the New Tape

An interesting scenario from Doug Cutting: Say you have a terabyte of data, on a disk with 10ms seek time and 100MB/s max throughput. You want to update 1% of the records. If you do it with random-access seeks, it takes 35 days to finish. On the other hand, if you scan the entire database…

Read the full article

Continuous Integration for Data

As I told a friend recently, I’m pretty happy with the front-end code of AltLaw. It’s just a simple Ruby on Rails app that uses Solr for search and storage. The code is small and easy to maintain. What I’m not happy with is the back-end code, the data extraction, formatting, and indexing. It’s a…

Read the full article

Privacy, Open Access, and the Law

Since we started putting court cases on the interwebs, first with Project Posner and then with AltLaw, we’ve had the occasional angry email from someone who Googles himself/herself and finds a court case from 20 years ago that reveals embarrassing and career-damaging facts. They usually want the page taken down. Now, sometimes I’m sympathetic with…

Read the full article