Hadoop supports adding new nodes to an existing cluster without shutting down or restarting any service. This recipe will outline the steps required to add a new node to a pre-existing cluster.
Ensure that you have a Hadoop cluster up and running. In addition, ensure that you have the Hadoop distribution extracted, and the configuration files have been updated with the settings from the recipe titled Starting Hadoop in distributed mode.
We will use the following terms for our imaginary cluster:
Server name |
Purpose |
Number of dedicated machines |
---|---|---|
head |
Will run the NameNode and JobTracker services |
1 |
secondary |
Will run the Secondary NameNode service |
1 |
worker(n) |
Will run the TraskTracker and DataNode services |
3 or greater |
Follow these steps to add new nodes to an existing cluster:
slaves
configuration file with the hostname of the new node:$ vi conf/slaves
worker1
worker2
worker3
worker4
$ ssh hadoop@worker4 $ cd /path/to/hadoop $ bin/hadoop-daemon.sh start datanode $ bin/hadoop-daemon.sh start tasktracker
We updated the slaves
configuration file on the head node to tell the Hadoop framework that a new node exists in the cluster. However, this file is only read when the Hadoop services are started (for example, by executing the bin/start-all.sh
script). In order to add the new node to the cluster without having to restart all of the Hadoop services, we logged into the new node, and started the DataNode and TaskTracker services manually.
When you add a new node to the cluster, the cluster is not properly balanced. HDFS will not automatically redistribute any existing data to the new node in order to balance the cluster. To rebalance the existing data in the cluster, you can run the following command from the head node:
# bin/start-balancer.sh
Rebalancing a Hadoop cluster is a network-intensive task. Imagine, we might be moving terabytes of data around, depending on the number of nodes added to the cluster. Job performance issues might arise when a cluster is in the process of rebalancing, and therefore regular rebalancing maintenance should be properly planned.