EC2 Authorizations for Hadoop

May 14th, 2008

I just did my first test-run of a Hadoop cluster on Amazon EC2. It’s not as tricky as it appears, although I ran into some snags, which I’ll document here. I also found these pages helpful: EC2 on Hadoop Wiki and manAmplified.

First, make sure the EC2 API tools are installed and on your path. Also make sure the EC2 environment variables are set. I added the following to my ~/.bashrc:

export EC2_HOME=$HOME/ec2-api-tools-1.3-19403
export EC2_PRIVATE_KEY=$HOME/.ec2/MY_PRIVATE_KEY_FILE
export EC2_CERT=$HOME/.ec2/MY_CERT_FILE
export PATH=$PATH:$EC2_HOME/bin

I also copied my generated SSH key to ~/.ec2/id_rsa-MY_KEY_NAME.

You need authorizations for the EC2 security group that Hadoop uses. The scripts in hadoop-*/src/contrib/ec2 are supposed to do this for you, but they didn’t for me. I had to do:

ec2-add-group hadoop-cluster-group -d "Group for Hadoop clusters."
ec2-authorize hadoop-cluster-group -p 22
ec2-authorize hadoop-cluster-group -o hadoop-cluster-group -u YOUR_AWS_ACCOUNT_ID
ec2-authorize hadoop-cluster-group -p 50030
ec2-authorize hadoop-cluster-group -p 50060

The first line creates the security group. The second line lets you SSH into it. The third line lets the individual nodes in the cluster communicate with one another. The fourth and fifth lines are optional; they let you monitor your MapReduce jobs through Hadoop’s web interface. (If you have a fixed IP address, you can be slightly more secure by adding -s YOUR_ADDRESS to the commands above.)

These authorizations are permanently tied to your AWS account, not to any particular group of instances, so you only need to do this once. You can see your current EC2 authorization settings with ec2-describe-group, it should look something like this:

GROUP   YOUR_AWS_ID    hadoop-cluster-group    Group for Hadoop clusters.
PERMISSION      YOUR_AWS_ID    hadoop-cluster-group    ALLOWS  all                     FROM    USER    YOUR_AWS_ID    GRPNAME hadoop-cluster-group
PERMISSION      YOUR_AWS_ID    hadoop-cluster-group    ALLOWS  tcp     22      22      FROM    CIDR    0.0.0.0/0

With additional lines for ports 50030 and 50060, if you enabled those.

Stop Your Java SAX Parser from Downloading DTDs

May 8th, 2008

Back in February, in a slightly plaintive post, the W3 sysadmins asked that people stop hammering their servers with requests for XHTML DTDs. Everyone said yes, this is a stupid problem that wouldn’t have happened if a) the XML spec were less dumb, or b) XML libraries were less dumb.

After that post, I spent two whole days fighting with XML catalogs — possibly the worst-documented XML spec ever — to make sure my Java code wasn’t downloading a DTD every time it read an XHTML document.

To my annoyance, no one seems to have posted any cut-and-paste solutions to this problem. Setting properties on the SAX parser is no help, and the XML catalogs solution is a pain to set up.

So what if someone wrote a “dummy” XML entity resolver that does nothing? Here’s what I came up with:

public class DummyEntityResolver implements EntityResolver {
    public InputSource resolveEntity(String publicID, String systemID)
        throws SAXException {

        return new InputSource(new StringReader(""));
    }
}

Lo and behold, it works! The key is the return line — if you return null, the SAX parser reverts to its default behavior and downloads the DTD.

Use it like this:

XMLReader reader = XMLReaderFactory.createXMLReader();
reader.setEntityResolver(new DummyEntityResolver());
reader.setContentHandler(new YourContentHandler());
reader.parse(your_xml_source);

The catch is that this will break any externally-defined entities, including standard XHTML entities like ©. The built-in XML entities such as &, and numeric character entities like &x43;, will still work.

You can check that you’re not downloading any DTD’s by watching the output of ngrep -q DTD while running your XML parser. If it doesn’t print anything, you’re good.

We Don’t Know How We Program

May 8th, 2008

Paul Johnson, in the U.K., wrote a piece about how there is no known “process” for programming.  At some point, all the theory and methodology goes out the window and someone has to sit down, think about the problem, and write some code.

I’m sure I won’t be the only one to suggest this, but I like to think of programming as analogous to writing prose.  You have an idea, a concept, something nebulous in your head, and you have to express it in words.  A good program has both structure and flow, just as good writing does.

In one sense, the languages we program in are far less expressive than any human language, but seen in a mathematical light they are more expressive.  The code for, say, Euclid’s algorithm is much shorter than the English description of what it does, no matter how verbose your statically-typed object-oriented programming language may be.

Calling Java Constructors with this()

May 5th, 2008

The things I don’t know about Java… could fill a book. Here’s a new one, from the Hadoop sources:

public ArrayWritable(Class valueClass) {
    // ...
}

public ArrayWritable(Class valueClass, Writable[] values) {
  this(valueClass);
  this.values = values;
}

The second constructor uses the syntax this(arg) to call a different constructor, then follows with initialization code of its own. I had no idea you could do that.

Astronauts Without Mission Control

May 1st, 2008

Joel Spolsky complains that architecture astronauts are taking over at big, rich companies like Google and Microsoft, pushing out elaborate architectural systems that don’t solve actual problems.

He’s right in that smart, technical people like to take on any large, abstract problem that is, as he puts it, “a fun programming exercise that you’re doing because it’s just hard enough to be interesting but not so hard that you can’t figure it out.”

But I think the constant reinvention of (to use Spolsky’s examples) Lotus Notes or file synchronization represents a failure of leadership more than a failure of the astronauts themselves. If you don’t give smart engineers something interesting to work on, they’ll invent something themselves. But all they know is what they studied in school — abstract architectures that don’t solve actual problems — so that’s what they invent. It takes a rare, creative individual to come up with an idea — spreadsheets, Napster, eBay — that is both an interesting technical challenge and a desirable product.

So my question to Mr. Spolsky is: what should all those architecture astronauts be working on instead?

A Million Little Files

April 24th, 2008

My PC-oriented brain says it’s easier to work with a million small files than one gigantic file. Hadoop says the opposite — big files are stored contiguously on disk, so they can be read/written efficiently. UNIX tar files work on the same principle, but Hadoop can’t read them directly because they don’t contain enough information for efficient splitting. So, I wrote a program to convert tar files into Hadoop sequence files.

Here’s some code (Apache license), including all the Apache jars needed to make it work:

tar-to-seq.tar.gz (6.1 MB)

Unpack it and run:

java -jar tar-to-seq.jar tar-file sequence-file

The output sequence file is BLOCK-compressed, about 1.4 times the size of a bzip2-compressed tar file. Each key is the name of a file (a Hadoop “Text”), the value is the binary contents of the file (a BytesWritable).

It took about an hour and a half to convert a 615MB tar.bz2 file to an 868MB sequence file. That’s slow, but it only has to be done once.

The Great Database Rewrite

April 23rd, 2008

I just discovered the paper The End of an Architectural Era (It’s Time for a Complete Rewrite), about re-designing database software from the ground up.  It contains some unsurprising predictions — “the next decade will bring domination by shared-nothing computer systems, often called grid computing” — and some interesting ideas:

  • Any database smaller than 1 TB will fit entirely in main memory, distributed across multiple machines.
  • We should scrap SQL in favor of “modifying little languages [Ruby, Python, etc.] to include clean embeddings of DBMS access.”  (CouchDB is a good example of this.)
  • A database shouldn’t require an expert to tune and optimize it; instead it should all be automated to “produce a system with no visible knobs.”

In their implementation, H-store, they claim to run over 70,000 transactions per second on a standard benchmark on a modest server, compared to 850/second from a commercial DB tuned by an expert.  They also plan to “move from C++ to Ruby on Rails as our stored procedure language.”  (!)

Power At Your Fingertips

April 22nd, 2008

I just ran my first Amazon EC2 instance.  Kind of a heady feeling, having nearly unlimited computing power just a few keystrokes away.  I got the same feeling the first time I logged in as root on a dedicated web server.

I gotta say, though, that the ticking meter — even at just $0.10/hour — will make me think real hard about how I use it.  I guess I’ll get used to it in time.  I’ll have to, since I’m running out of disk space on my local hard drive again.

There Is No Database

April 21st, 2008

I think I’m starting to get a handle on how Hadoop is supposed to work. The MapReduce model isn’t what troubles me.  The mind-bending part is that there is no database. Everything happens by scanning big files from beginning to end. It’s like everything I learned about data structures with O(log n) access no longer applies, and I’m back to writing giant for-each loops. It’s perl -n gone mad.

I’ve been trying for months to find the most efficient database for AltLaw — SQL, Lucene, RDF, even Berkeley DB.  But it still takes hours and hours to process things.  Maybe the secret is to get rid of the databases and just mash together some giant files and throw them at Hadoop.

Disk is the New Tape

April 17th, 2008

An interesting scenario from Doug Cutting: Say you have a terabyte of data, on a disk with 10ms seek time and 100MB/s max throughput. You want to update 1% of the records. If you do it with random-access seeks, it takes 35 days to finish. On the other hand, if you scan the entire database sequentially and write it back out again, it takes 5.6 hours.

This is why Hadoop only supports linear access to the filesystem. It’s also why Hadoop coder Tom White says disks have become tapes.  All this is contrary to the way I think about data — building hash tables and indexes to do fast random-access lookups.  I’m still trying to get my head around this concept of “linear” data processing.  But I have found that I can do some things faster by reading sequentially through a batch of files rather than trying to stuff everything in a database (RDF or SQL) and doing big join queries.