As mentioned in the previous recipe, Hadoop supports three different operating modes:
This recipe will describe how to set up Hadoop to run in fully-distributed mode. In fully-distributed mode, HDFS and the MapReduce services will run across multiple machines. A typical architecture is to have a dedicated node run the NameNode and the JobTracker services, another dedicated node to host the Secondary NameNode service, and the remaining nodes in the cluster running both the DataNode and TaskTracker services.
This recipe will assume that steps 1 through 5 from the recipe Starting Hadoop in pseudo-distributed mode of this chapter have been completed. There should be a user named hadoop
on every node in the cluster. In addition, the rsa
public key generated in step 2 of the previous recipe must be distributed and installed on every node in the cluster using the ssh-copy-id
command. Finally, the Hadoop distribution should be extracted and deployed on every node in the cluster.
We will now discuss the specific configurations required to get the cluster running in distributed mode. We will assume that your cluster will use the following configuration:
Server name |
Purpose |
Number of dedicated machines |
---|---|---|
head |
Will run the NameNode and JobTracker services |
1 |
secondary |
Will run the Secondary NameNode service |
1 |
worker(n) |
Will run the TaskTracker and DataNode services |
3 or greater |
Perform the following steps to start Hadoop in fully-distributed mode:
$ vi conf/core-site.xml <configuration> <property> <name>fs.default.name</name> <value>hdfs://head:8020</value> </property> </configuration> $ vi conf/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration> $ vi conf/mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>head:8021</value> </property> </configuration>
masters
and slaves
configuration files on the head node. The masters
configuration file contains the hostname of the node which will run the Secondary NameNode. The slaves
configuration file contains a list of the hosts which will run the TaskTracker and DataNode services:$ vi conf/masters secondary $ vi conf/slaves worker1 worker2 worker3
$ bin/hadoop namenode –format
hadoop
user, start all of the Hadoop services:$ bin/start-all.sh
First we changed the Hadoop configuration files core-site.xml
, hdfs-site.xml
, and mapred-site.xml
on every node in the cluster.
These configuration files need to be updated to tell the Hadoop services running on every node where to find the NameNode and JobTracker services. In addition, we changed the HDFS replication factor to 3
. Since we have three or more nodes available, we changed the replication from 1
to 3
in order to support high data availability in case one of the worker nodes experiences a failure.
It is not necessary to run the Secondary NameNode on a separate node. You can run the Secondary NameNode on the same node as the NameNode and JobTracker, if you wish. To do this, stop the cluster, modify the
masters
configuration file on the master node, and restart all of the services:
$ bin/stop-all.sh $ vi masters head $ bin/start-all.sh
Another set of configuration parameters that will come in handy when your cluster grows or when you wish to perform maintenance, are the exclusion list parameters that can be added to the
mapred-site.xml
configuration file. By adding the following lines to mapred-site.xml
, you can list the nodes that will be barred
from connecting to the NameNode (dfs.hosts.exclude
) and/or the JobTracker (mapred.hosts.exclude
). These configuration parameters will be used later when we discuss decommissioning of a node in the cluster:
<property> <name>dfs.hosts.exclude</name> <value>/path/to/hadoop/dfs_excludes</value> <final>true</final> </property> <property> <name>mapred.hosts.exclude</name> <value>/path/to/hadoop/mapred_excludes </value> <final>true</final> </property>
Create two empty files named dfs_excludes
, and mapred_excludes
for future use:
$ touch /path/to/hadoop/dfs_excludes $ touch /path/to/hadoop/mapred_excludes
Start the cluster:
$ bin/start-all.sh