I just did my first test-run of a Hadoop cluster on Amazon EC2. It’s not as tricky as it appears, although I ran into some snags, which I’ll document here. I also found these pages helpful: EC2 on Hadoop Wiki and manAmplified.
First, make sure the EC2 API tools are installed and on your path. Also make sure the EC2 environment variables are set. I added the following to my ~/.bashrc
:
export EC2_HOME=$HOME/ec2-api-tools-1.3-19403 export EC2_PRIVATE_KEY=$HOME/.ec2/MY_PRIVATE_KEY_FILE export EC2_CERT=$HOME/.ec2/MY_CERT_FILE export PATH=$PATH:$EC2_HOME/bin
I also copied my generated SSH key to ~/.ec2/id_rsa-MY_KEY_NAME
.
You need authorizations for the EC2 security group that Hadoop uses. The scripts in hadoop-*/src/contrib/ec2
are supposed to do this for you, but they didn’t for me. I had to do:
ec2-add-group hadoop-cluster-group -d "Group for Hadoop clusters." ec2-authorize hadoop-cluster-group -p 22 ec2-authorize hadoop-cluster-group -o hadoop-cluster-group -u YOUR_AWS_ACCOUNT_ID ec2-authorize hadoop-cluster-group -p 50030 ec2-authorize hadoop-cluster-group -p 50060
The first line creates the security group. The second line lets you SSH into it. The third line lets the individual nodes in the cluster communicate with one another. The fourth and fifth lines are optional; they let you monitor your MapReduce jobs through Hadoop’s web interface. (If you have a fixed IP address, you can be slightly more secure by adding -s YOUR_ADDRESS
to the commands above.)
These authorizations are permanently tied to your AWS account, not to any particular group of instances, so you only need to do this once. You can see your current EC2 authorization settings with ec2-describe-group, it should look something like this:
GROUP YOUR_AWS_ID hadoop-cluster-group Group for Hadoop clusters. PERMISSION YOUR_AWS_ID hadoop-cluster-group ALLOWS all FROM USER YOUR_AWS_ID GRPNAME hadoop-cluster-group PERMISSION YOUR_AWS_ID hadoop-cluster-group ALLOWS tcp 22 22 FROM CIDR 0.0.0.0/0
With additional lines for ports 50030 and 50060, if you enabled those.
Thanks Stuart for posting this. It was really helpful, even when using Cloudera’s stuff =)
I still have to do this step manually/in my script.