Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment

Apache Whirr provides a set of cloud vendor neutral set of libraries to provision services on the cloud resources. Apache Whirr supports provisioning, installing, and configuring of Hadoop clusters in several cloud environments. In addition to Hadoop, Apache Whirr also supports provisioning of Apache Cassandra, Apache ZooKeeper, Apache HBase, Valdemort (key-value storage), and Apache Hama clusters on the cloud environments.

In this recipe, we are going to start a Hadoop cluster on Amazon EC2 cloud using Apache Whirr and run the WordCount MapReduce sample (Writing the WordCount MapReduce sample, bundling it and running it using standalone Hadoop recipe from Chapter 1, Getting Hadoop up and running in a Cluster) program on that cluster.

How to do it...

The following are the steps to deploy a Hadoop cluster on Amazon EC2 cloud using Apache Whirr and to execute the WordCount MapReduce sample on the deployed cluster.

  1. Download and unzip the Apache Whirr binary distribution from http://whirr.apache.org/.
  2. Run the following command from the extracted directory to verify your Whirr installation.
    >bin/whirr version
    Apache Whirr 0.8.0
    jclouds 1.5.0-beta.10
    
  3. Create a directory in your home directory named .whirr. Copy the conf/credentials.sample file in the Whirr directory to the newly created directory.
    >mkdir ~/.whirr
    >cp conf/credentials.sample ~/.whirr/credentials
    
  4. Add your AWS credentials to the ~/.whirr/credentials file by editing it as below. You can retrieve your AWS Access Keys from the AWS console (http://console.aws.amazon.com) by clicking on the Security Credentials in the context menu that appears by clicking your AWS username in the upper-right corner of the console. A sample credentials file is provide in the resources/whirr folder of the resources for this chapter.
    # Set cloud provider connection details
    PROVIDER=aws-ec2 
    IDENTITY=<AWS Access Key ID>
    CREDENTIAL=<AWS Secret Access Key>
  5. Generate a rsa key pair using the following command. This key pair is not the same as your AWS key pair.
    >ssh-keygen -t rsa -P ''''
    
  6. Copy the following to a file named hadoop.properties. If you provided a custom name for your key-pair in the preceding step, change the whirr.private-key-file and the whirr.public-key-file property values to the paths of the private key and the public key you generated. A sample hadoop.properties file is provided in the resources/whirr directory of the chapter resources.

    Tip

    whirr.aws-ec2-spot-price is an optional property that allows us to use cheaper EC2 Spot Instances. You can delete that property to use EC2 traditional on-demand instances.

    whirr.cluster-name=whirrhadoopcluster
    whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,2 hadoop-datanode+hadoop-tasktracker 
    whirr.provider=aws-ec2
    whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
    whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
    whirr.hadoop.version=1.0.2
    whirr.aws-ec2-spot-price=0.08
  7. Execute the following command in the whirr directory to launch your Hadoop cluster on EC2.
    >bin/whirr launch-cluster --config hadoop.properties
    
  8. The traffic from the outside to the provisioned EC2 Hadoop cluster is routed through the master node. Whirr generates a script that we can use to start this proxy, under a subdirectory named after your Hadoop cluster inside the ~/.whirr directory. Run this in a new terminal. It will take few minutes for whirr to start the cluster and to generate this script.
    >cd ~/.whirr/whirrhadoopcluster/
    >hadoop-proxy.sh
    
  9. You can open the Hadoop web based monitoring console in your local machine by configuring this proxy in your web browser.
  10. Whirr generates a hadoop-site.xml for your cluster in the ~/.whirr/<your cluster name> directory. You can use it to issue Hadoop commands from your local machine to your Hadoop cluster on EC2. Export the path of the generated hadoop-conf.xml file to an environmental variable named HADOOP_CONF_DIR. To execute the Hadoop commands, you should add the $HADOOP_HOME/bin directory to your path or you should issue the commands from the $HADOOP_HOME/bin directory.
    >export HADOOP_CONF_DIR=~/.whirr/whirrhadoopcluster/
    >hadoop fs -ls /
    
  11. Create a directory named wc-input-data in HDFS and upload a text data set to that directory.
    >hadoop fs -mkdir wc-input-data
    >hadoop fs -put sample.txt wc-input-data
    
  12. In this step, we run the Hadoop WordCount sample in the Hadoop cluster we started in Amazon EC2.
    >hadoop jar ~/workspace/HadoopBookChap10/c10-samples.jar chapter1.WordCount wc-input-data wc-out
    
  13. View the results of the WordCount computation by executing the following commands:
    >hadoop fs -ls wc-out
    Found 3 items
    -rw-r--r--   3 thilina supergroup          0 2012-09-05 15:40 /user/thilina/wc-out/_SUCCESS
    drwxrwxrwx   - thilina supergroup          0 2012-09-05 15:39 /user/thilina/wc-out/_logs
    -rw-r--r--   3 thilina supergroup      19908 2012-09-05 15:40 /user/thilina/wc-out/part-r-00000
    
    >hadoop fs -cat wc-out/part-* | more
    
  14. Issue the following command to shut down the Hadoop cluster. Make sure to download any important data before shutting down the cluster, as the data will be permanently lost after shutting down the cluster.
    >bin/whirr destroy-cluster --config hadoop.properties
    

How it works...

This section describes the properties we used in the hadoop.properties file.

whirr.cluster-name=whirrhadoopcluster

The preceding property provides a name for the cluster. The instances of the cluster will be tagged using this name.

whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+hadoop-tasktracker

The preceding property specifies the number of instances to be used for each set of roles and the type of roles for the instances. In the above example, one EC2 small instance is used with roles hadoop-jobtracker and the hadoop-namenode. Another two EC2 small instances are used with roles hadoop-datanode and hadoop-tasktracker in each instance.

whirr.provider=aws-ec2

We use the Whirr Amazon EC2 provider to provision our cluster.

whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub

The preceding two properties point to the paths of the private key and the public key you provide for the cluster.

whirr.hadoop.version=1.0.2

We specify a custom Hadoop version using the preceding property. By default, Whirr 0.8 provisions a Hadoop 0.20.x cluster.

whirr.aws-ec2-spot-price=0.08

The preceding property specifies a bid price for the Amazon EC2 Spot Instances. Specifying this property triggers Whirr to use EC2 spot instances for the cluster. If the bid price is not met, Apache Whirr spot instance requests time out after 20 minutes. Refer to the Saving money by using Amazon EC2 Spot Instances to execute EMR job flows recipe for more details.

More details on Whirr configuration can be found on http://whirr.apache.org/docs/0.6.0/configuration-guide.html.

See also

  • The Using Apache Whirr to deploy an Apache HBase cluster in a cloud environment and Saving money by using Amazon EC2 Spot Instances to execute EMR job flows recipes of this chapter.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset