Apache Whirr provides a set of cloud vendor neutral set of libraries to provision services on the cloud resources. Apache Whirr supports provisioning, installing, and configuring of Hadoop clusters in several cloud environments. In addition to Hadoop, Apache Whirr also supports provisioning of Apache Cassandra, Apache ZooKeeper, Apache HBase, Valdemort (key-value storage), and Apache Hama clusters on the cloud environments.
In this recipe, we are going to start a Hadoop cluster on Amazon EC2 cloud using Apache Whirr and run the WordCount MapReduce sample (Writing the WordCount MapReduce sample, bundling it and running it using standalone Hadoop recipe from Chapter 1, Getting Hadoop up and running in a Cluster) program on that cluster.
The following are the steps to deploy a Hadoop cluster on Amazon EC2 cloud using Apache Whirr and to execute the WordCount MapReduce sample on the deployed cluster.
>bin/whirr version Apache Whirr 0.8.0 jclouds 1.5.0-beta.10
.whirr
. Copy the conf/credentials.sample
file in the Whirr directory to the newly created directory.>mkdir ~/.whirr >cp conf/credentials.sample ~/.whirr/credentials
~/.whirr/credentials
file by editing it as below. You can retrieve your AWS Access Keys from the AWS console (http://console.aws.amazon.com) by clicking on the Security Credentials in the context menu that appears by clicking your AWS username in the upper-right corner of the console. A sample credentials
file is provide in the resources/whirr
folder of the resources for this chapter.# Set cloud provider connection details PROVIDER=aws-ec2 IDENTITY=<AWS Access Key ID> CREDENTIAL=<AWS Secret Access Key>
rsa
key pair using the following command. This key pair is not the same as your AWS key pair.>ssh-keygen -t rsa -P ''''
hadoop.properties
. If you provided a custom name for your key-pair in the preceding step, change the whirr.private-key-file
and the whirr.public-key-file
property values to the paths of the private key and the public key you generated. A sample hadoop.properties
file is provided in the resources/whirr
directory of the chapter resources.whirr.cluster-name=whirrhadoopcluster whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,2 hadoop-datanode+hadoop-tasktracker whirr.provider=aws-ec2 whirr.private-key-file=${sys:user.home}/.ssh/id_rsa whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub whirr.hadoop.version=1.0.2 whirr.aws-ec2-spot-price=0.08
>bin/whirr launch-cluster --config hadoop.properties
~/.whirr
directory. Run this in a new terminal. It will take few minutes for whirr to start the cluster and to generate this script.>cd ~/.whirr/whirrhadoopcluster/ >hadoop-proxy.sh
hadoop-site.xml
for your cluster in the ~/.whirr/<your cluster name>
directory. You can use it to issue Hadoop commands from your local machine to your Hadoop cluster on EC2. Export the path of the generated hadoop-conf.xml
file to an environmental variable named HADOOP_CONF_DIR
. To execute the Hadoop commands, you should add the $HADOOP_HOME/bin
directory to your path or you should issue the commands from the $HADOOP_HOME/bin
directory.>export HADOOP_CONF_DIR=~/.whirr/whirrhadoopcluster/ >hadoop fs -ls /
wc-input-data
in HDFS and upload a text data set to that directory.>hadoop fs -mkdir wc-input-data >hadoop fs -put sample.txt wc-input-data
>hadoop jar ~/workspace/HadoopBookChap10/c10-samples.jar chapter1.WordCount wc-input-data wc-out
>hadoop fs -ls wc-out Found 3 items -rw-r--r-- 3 thilina supergroup 0 2012-09-05 15:40 /user/thilina/wc-out/_SUCCESS drwxrwxrwx - thilina supergroup 0 2012-09-05 15:39 /user/thilina/wc-out/_logs -rw-r--r-- 3 thilina supergroup 19908 2012-09-05 15:40 /user/thilina/wc-out/part-r-00000 >hadoop fs -cat wc-out/part-* | more
>bin/whirr destroy-cluster --config hadoop.properties
This section describes the properties we used in the hadoop.properties
file.
whirr.cluster-name=whirrhadoopcluster
The preceding property provides a name for the cluster. The instances of the cluster will be tagged using this name.
whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+hadoop-tasktracker
The preceding property specifies the number of instances to be used for each set of roles and the type of roles for the instances. In the above example, one EC2 small instance is used with roles hadoop-jobtracker and the hadoop-namenode
. Another two EC2 small instances are used with roles hadoop-datanode
and hadoop-tasktracker
in each instance.
whirr.provider=aws-ec2
We use the Whirr Amazon EC2 provider to provision our cluster.
whirr.private-key-file=${sys:user.home}/.ssh/id_rsa whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
The preceding two properties point to the paths of the private key and the public key you provide for the cluster.
whirr.hadoop.version=1.0.2
We specify a custom Hadoop version using the preceding property. By default, Whirr 0.8 provisions a Hadoop 0.20.x cluster.
whirr.aws-ec2-spot-price=0.08
The preceding property specifies a bid price for the Amazon EC2 Spot Instances. Specifying this property triggers Whirr to use EC2 spot instances for the cluster. If the bid price is not met, Apache Whirr spot instance requests time out after 20 minutes. Refer to the Saving money by using Amazon EC2 Spot Instances to execute EMR job flows recipe for more details.
More details on Whirr configuration can be found on http://whirr.apache.org/docs/0.6.0/configuration-guide.html
.