Setting Hadoop in a distributed cluster environment

Hadoop deployment includes a HDFS deployment, a single job tracker, and multiple TaskTrackers. In the preceding recipe, Setting up HDFS, we discussed the HDFS deployment. For the Hadoop setup, we need to configure JobTrackers and TaskTrackers and then specify the TaskTrackers in the HADOOP_HOME/conf/slaves file. When we start the JobTracker, it will start the TaskTracker nodes. The following diagram illustrates a Hadoop deployment:

Setting Hadoop in a distributed cluster environment

Getting ready

You may follow this recipe either using a single machine or multiple machines. If you are using multiple machines, you should choose one machine as the master node where you will run the HDFS NameNode and the JobTracker. If you are using a single machine, use it as both the master node as well as a slave node.

  1. Install Java in all machines that will be used to set up Hadoop.
  2. If you are using Windows machines, first install Cygwin and SSH server in each machine. The link http://pigtail.net/LRP/printsrv/cygwin-sshd.html provides step-by-step instructions.

How to do it...

Let us set up Hadoop by setting up the JobTracker and TaskTrackers.

  1. In each machine, create a directory for Hadoop data. Let's call this directory HADOOP_DATA_DIR. Then create three directories, HADOOP_DATA_DIR/data, HADOOP_DATA_DIR/local, and HADOOP_DATA_DIR/name.
  2. Set up SSH keys to all machines so that we can log in to all from the master node. The Setting up HDFS recipe describes the SSH setup in detail.
  3. Unzip the Hadoop distribution at the same location in all machines using the >tar -zxvf hadoop-1.x.x.tar.gz command. You can use any of the Hadoop 1.0 branch distributions.
  4. In all machines, edit the HADOOP_HOME/conf/hadoop-env.sh file by uncommenting the JAVA_HOME line and point it to your local Java installation. For example, if Java is in /opt/jdk1.6, change the JAVA_HOME line to export JAVA_HOME=/opt/jdk1.6.
  5. Place the IP address of the node used as the master (for running JobTracker and NameNode) in HADOOP_HOME/conf/masters in a single line. If you are doing a single-node deployment, leave the current value, localhost, as it is.
    209.126.198.72
  6. Place the IP addresses of all slave nodes in the HADOOP_HOME/conf/slaves file, each in a separate line.
    209.126.198.72
    209.126.198.71
  7. Inside each node's HADOOP_HOME/conf directory, add the following to the core-site.xml, hdfs-site.xml and mapred-site.xml. Before adding the configurations, replace the MASTER_NODE with the IP of the master node and HADOOP_DATA_DIR with the directory you created in the first step.

    Add URL of the NameNode to HADOOP_HOME/conf/core-site.xml.

    <configuration>
    <property>
    <name>fs.default.name</name>
    <value>hdfs://MASTER_NODE:9000/</value>
    </property>
    </configuration>

    Add locations to store metadata (names) and data within HADOOP_HOME/conf/hdfs-site.xml to submit jobs:

    <configuration>
    <property>
    <name>dfs.name.dir</name>
    <value>HADOOP_DATA_DIR/name</value>
    </property>
    <property>
    <name>dfs.data.dir</name>
    <value>HADOOP_DATA_DIR/data</value>
    </property>
    </configuration>

    Map reduce local directory is the location used by Hadoop to store temporary files used. Add JobTracker location to HADOOP_HOME/conf/mapred-site.xml. Hadoop will use this for the jobs. The final property sets the maximum map tasks per node, set it the same as the amount of cores (CPU).

    <configuration>
    <property>
    <name>mapred.job.tracker</name>
    <value>MASTER_NODE:9001</value>
    </property>
    <property>
    <name>mapred.local.dir</name>
    <value>HADOOP_DATA_DIR/local</value>
    </property>
    <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>8</value>
    </property>
    </configuration>
  8. To format a new HDFS filesystem, run the following command from the Hadoop NameNode (master node). If you have done this as part of the HDFS installation in earlier recipe, you can skip this step.
    >bin/hadoop namenode –format
    ...
    /Users/srinath/playground/hadoop-book/hadoop-temp/dfs/name has been successfully formatted.
    12/04/09 08:44:51 INFO namenode.NameNode: SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at Srinath-s-MacBook-Pro.local/172.16.91.1
    ************************************************************/
    
  9. In the master node, change the directory to HADOOP_HOME and run the following commands:
    >bin/start-dfs.sh
    starting namenode, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-namenode-node7.beta.out
    209.126.198.72: starting datanode, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-datanode-node7.beta.out
    209.126.198.71: starting datanode, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-datanode-node6.beta.out
    209.126.198.72: starting secondarynamenode, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-secondarynamenode-node7.beta.out
    >bin/start-mapred.sh
    starting jobtracker, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-jobtracker-node7.beta.out
    209.126.198.72: starting tasktracker, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-tasktracker-node7.beta.out
    209.126.198.71: starting tasktracker, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-tasktracker-node6.beta.out
    
  10. Verify the installation by listing the processes through the ps | grep java command (if you are using Linux) or via Task Manager (if you are in Windows), in the master node and slave nodes. Master node will list four processes—NameNode, DataNode, JobTracker, and TaskTracker and slaves will have a DataNode and TaskTracker.
  11. Browse the web-based monitoring pages for namenode and JobTracker:
    • NameNode: http://MASTER_NODE:50070/.
    • JobTracker: http://MASTER_NODE:50030/.
  12. You can find the logfiles under ${HADOOP_HOME}/logs.
  13. Make sure HDFS setup is OK by listing the files using HDFS command line.
    bin/hadoop dfs -ls /
    
    Found 2 items
    drwxr-xr-x   - srinath supergroup    0 2012-04-09 08:47 /Users
    drwxr-xr-x   - srinath supergroup    0 2012-04-09 08:47 /tmp
    

How it works...

As described in the introduction to the chapter, Hadoop installation consists of HDFS nodes, a JobTracker and worker nodes. When we start the NameNode, it finds the slaves through the HADOOP_HOME/slaves file and uses SSH to start the DataNodes in the remote server at the startup. Also when we start the JobTracker, it finds the slaves through the HADOOP_HOME/slaves file and starts the TaskTrackers.

There's more...

In the next recipe, we will discuss how to run the aforementioned WordCount program using the distributed setup. The following recipes will discuss how to use MapReduce monitoring UI to monitor the distributed Hadoop setup.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset