Hadoop deployment includes a HDFS deployment, a single job tracker, and multiple TaskTrackers. In the preceding recipe, Setting up HDFS, we discussed the HDFS deployment. For the Hadoop setup, we need to configure JobTrackers and TaskTrackers and then specify the TaskTrackers in the HADOOP_HOME/conf/slaves
file. When we start the JobTracker, it will start the TaskTracker nodes. The following diagram illustrates a Hadoop deployment:
You may follow this recipe either using a single machine or multiple machines. If you are using multiple machines, you should choose one machine as the master node where you will run the HDFS NameNode and the JobTracker. If you are using a single machine, use it as both the master node as well as a slave node.
Let us set up Hadoop by setting up the JobTracker and TaskTrackers.
HADOOP_DATA_DIR
. Then create three directories, HADOOP_DATA_DIR/data
, HADOOP_DATA_DIR/local
, and HADOOP_DATA_DIR/name
.>tar -zxvf hadoop-1.x.x.tar.gz
command. You can use any of the Hadoop 1.0 branch distributions.HADOOP_HOME/conf/hadoop-env.sh
file by uncommenting the JAVA_HOME
line and point it to your local Java installation. For example, if Java is in /opt/jdk1.6
, change the JAVA_HOME
line to export JAVA_HOME=/opt/jdk1.6
.HADOOP_HOME/conf/masters
in a single line. If you are doing a single-node deployment, leave the current value, localhost
, as it is.209.126.198.72
HADOOP_HOME/conf/slaves
file, each in a separate line.209.126.198.72 209.126.198.71
HADOOP_HOME/conf
directory, add the following to the core-site.xml
, hdfs-site.xml
and mapred-site.xml
. Before adding the configurations, replace the MASTER_NODE
with the IP of the master node and HADOOP_DATA_DIR
with the directory you created in the first step.Add URL of the NameNode to HADOOP_HOME/conf/core-site.xml
.
<configuration> <property> <name>fs.default.name</name> <value>hdfs://MASTER_NODE:9000/</value> </property> </configuration>
Add locations to store metadata (names) and data within HADOOP_HOME/conf/hdfs-site.xml
to submit jobs:
<configuration> <property> <name>dfs.name.dir</name> <value>HADOOP_DATA_DIR/name</value> </property> <property> <name>dfs.data.dir</name> <value>HADOOP_DATA_DIR/data</value> </property> </configuration>
Map reduce local directory is the location used by Hadoop to store temporary files used. Add JobTracker location to HADOOP_HOME/conf/mapred-site.xml
. Hadoop will use this for the jobs. The final property sets the maximum map tasks per node, set it the same as the amount of cores (CPU).
<configuration> <property> <name>mapred.job.tracker</name> <value>MASTER_NODE:9001</value> </property> <property> <name>mapred.local.dir</name> <value>HADOOP_DATA_DIR/local</value> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>8</value> </property> </configuration>
>bin/hadoop namenode –format ... /Users/srinath/playground/hadoop-book/hadoop-temp/dfs/name has been successfully formatted. 12/04/09 08:44:51 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at Srinath-s-MacBook-Pro.local/172.16.91.1 ************************************************************/
HADOOP_HOME
and run the following commands:>bin/start-dfs.sh starting namenode, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-namenode-node7.beta.out 209.126.198.72: starting datanode, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-datanode-node7.beta.out 209.126.198.71: starting datanode, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-datanode-node6.beta.out 209.126.198.72: starting secondarynamenode, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-secondarynamenode-node7.beta.out >bin/start-mapred.sh starting jobtracker, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-jobtracker-node7.beta.out 209.126.198.72: starting tasktracker, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-tasktracker-node7.beta.out 209.126.198.71: starting tasktracker, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-tasktracker-node6.beta.out
ps
| grep java
command (if you are using Linux) or via Task Manager (if you are in Windows), in the master node and slave nodes. Master node will list four processes—NameNode, DataNode, JobTracker, and TaskTracker and slaves will have a DataNode and TaskTracker.http://MASTER_NODE:50070/
.http://MASTER_NODE:50030/
.${HADOOP_HOME}/logs
.bin/hadoop dfs -ls / Found 2 items drwxr-xr-x - srinath supergroup 0 2012-04-09 08:47 /Users drwxr-xr-x - srinath supergroup 0 2012-04-09 08:47 /tmp
As described in the introduction to the chapter, Hadoop installation consists of HDFS nodes, a JobTracker and worker nodes. When we start the NameNode, it finds the slaves through the HADOOP_HOME/slaves
file and uses SSH to start the DataNodes in the remote server at the startup. Also when we start the JobTracker, it finds the slaves through the HADOOP_HOME/slaves
file and starts the TaskTrackers.