Archive for the 'Programming' Category

EC2 Authorizations for Hadoop

Wednesday, May 14th, 2008

I just did my first test-run of a Hadoop cluster on Amazon EC2. It’s not as tricky as it appears, although I ran into some snags, which I’ll document here. I also found these pages helpful: EC2 on Hadoop Wiki and manAmplified.
First, make sure the EC2 API tools are installed and on your [...]

Stop Your Java SAX Parser from Downloading DTDs

Thursday, May 8th, 2008

Back in February, in a slightly plaintive post, the W3 sysadmins asked that people stop hammering their servers with requests for XHTML DTDs. Everyone said yes, this is a stupid problem that wouldn’t have happened if a) the XML spec were less dumb, or b) XML libraries were less dumb.
After that post, I spent [...]

We Don’t Know How We Program

Thursday, May 8th, 2008

Paul Johnson, in the U.K., wrote a piece about how there is no known “process” for programming.  At some point, all the theory and methodology goes out the window and someone has to sit down, think about the problem, and write some code.
I’m sure I won’t be the only one to suggest this, but I [...]

Calling Java Constructors with this()

Monday, May 5th, 2008

The things I don’t know about Java… could fill a book. Here’s a new one, from the Hadoop sources:

public ArrayWritable(Class valueClass) {
// …
}

public ArrayWritable(Class valueClass, Writable[] values) {
this(valueClass);
this.values = values;
}

The second constructor uses the syntax this(arg) to call a different constructor, then follows with initialization code [...]

Astronauts Without Mission Control

Thursday, May 1st, 2008

Joel Spolsky complains that architecture astronauts are taking over at big, rich companies like Google and Microsoft, pushing out elaborate architectural systems that don’t solve actual problems.
He’s right in that smart, technical people like to take on any large, abstract problem that is, as he puts it, “a fun programming exercise that you’re doing because [...]

A Million Little Files

Thursday, April 24th, 2008

My PC-oriented brain says it’s easier to work with a million small files than one gigantic file. Hadoop says the opposite — big files are stored contiguously on disk, so they can be read/written efficiently. UNIX tar files work on the same principle, but Hadoop can’t read them directly because they don’t contain [...]

The Great Database Rewrite

Wednesday, April 23rd, 2008

I just discovered the paper The End of an Architectural Era (It’s Time for a Complete Rewrite), about re-designing database software from the ground up.  It contains some unsurprising predictions — “the next decade will bring domination by shared-nothing computer systems, often called grid computing” — and some interesting ideas:

Any database smaller than 1 TB [...]

Power At Your Fingertips

Tuesday, April 22nd, 2008

I just ran my first Amazon EC2 instance.  Kind of a heady feeling, having nearly unlimited computing power just a few keystrokes away.  I got the same feeling the first time I logged in as root on a dedicated web server.
I gotta say, though, that the ticking meter — even at just $0.10/hour — will [...]

There Is No Database

Monday, April 21st, 2008

I think I’m starting to get a handle on how Hadoop is supposed to work. The MapReduce model isn’t what troubles me.  The mind-bending part is that there is no database. Everything happens by scanning big files from beginning to end. It’s like everything I learned about data structures with O(log n) access no [...]

Disk is the New Tape

Thursday, April 17th, 2008

An interesting scenario from Doug Cutting: Say you have a terabyte of data, on a disk with 10ms seek time and 100MB/s max throughput. You want to update 1% of the records. If you do it with random-access seeks, it takes 35 days to finish. On the other hand, if you scan [...]