EC2 Authorizations for Hadoop

I just did my first test-run of a Hadoop cluster on Amazon EC2. It’s not as tricky as it appears, although I ran into some snags, which I’ll document here. I also found these pages helpful: EC2 on Hadoop Wiki and manAmplified.

First, make sure the EC2 API tools are installed and on your path. Also make sure the EC2 environment variables are set. I added the following to my ~/.bashrc:

export EC2_HOME=$HOME/ec2-api-tools-1.3-19403
export EC2_PRIVATE_KEY=$HOME/.ec2/MY_PRIVATE_KEY_FILE
export EC2_CERT=$HOME/.ec2/MY_CERT_FILE
export PATH=$PATH:$EC2_HOME/bin

I also copied my generated SSH key to ~/.ec2/id_rsa-MY_KEY_NAME.

You need authorizations for the EC2 security group that Hadoop uses. The scripts in hadoop-*/src/contrib/ec2 are supposed to do this for you, but they didn’t for me. I had to do:

ec2-add-group hadoop-cluster-group -d "Group for Hadoop clusters."
ec2-authorize hadoop-cluster-group -p 22
ec2-authorize hadoop-cluster-group -o hadoop-cluster-group -u YOUR_AWS_ACCOUNT_ID
ec2-authorize hadoop-cluster-group -p 50030
ec2-authorize hadoop-cluster-group -p 50060

The first line creates the security group. The second line lets you SSH into it. The third line lets the individual nodes in the cluster communicate with one another. The fourth and fifth lines are optional; they let you monitor your MapReduce jobs through Hadoop’s web interface. (If you have a fixed IP address, you can be slightly more secure by adding -s YOUR_ADDRESS to the commands above.)

These authorizations are permanently tied to your AWS account, not to any particular group of instances, so you only need to do this once. You can see your current EC2 authorization settings with ec2-describe-group, it should look something like this:

GROUP   YOUR_AWS_ID    hadoop-cluster-group    Group for Hadoop clusters.
PERMISSION      YOUR_AWS_ID    hadoop-cluster-group    ALLOWS  all                     FROM    USER    YOUR_AWS_ID    GRPNAME hadoop-cluster-group
PERMISSION      YOUR_AWS_ID    hadoop-cluster-group    ALLOWS  tcp     22      22      FROM    CIDR    0.0.0.0/0

With additional lines for ports 50030 and 50060, if you enabled those.

1 thought on “EC2 Authorizations for Hadoop”

  1. Thanks Stuart for posting this. It was really helpful, even when using Cloudera’s stuff =)
    I still have to do this step manually/in my script.

Comments are closed.