This recipe shows how to add new nodes to an existing HDFS cluster without restarting the whole cluster, and how to force HDFS to rebalance after the addition of new nodes.
To get started, follow these steps:
rsync
to copy the Hadoop configuration from another node. For example:>rsync -a <master_node_ip>:hadoop-1.0.x/conf $HADOOP_HOME/conf
bin/*.sh
scripts from the master node to start/stop the cluster.The following steps will show you how to add a new DataNode to an existing HDFS cluster:
$HADOOP_HOME/conf/slaves
file in the master node.>bin/hadoop-deamon.sh start datanode
$HADOOP_HOME/logs/hadoop-*-datanode-*.log
in the new slave node for any errors.The preceding steps apply both to adding a new node as well as re-joining a node that has been crashed and restarted.
Similarly, you can add a new node to the Hadoop MapReduce cluster as well.
>bin/hadoop-deamon.sh start tasktracker
$HADOOP_HOME/logs/hadoop-*-tasktracker-*.log
in the new slave node for any errors.When you add new nodes, HDFS will not rebalance automatically. However, HDFS provides a rebalancer tool that can be invoked manually. This tool will balance the data blocks across cluster up to an optional threshold percentage. Rebalancing would be very helpful if you are having space issues in the other existing nodes.
–threshold
parameter specifies the percentage of disk capacity leeway to consider when identifying a node as under- or over-utilized. An under-utilized data node is a node whose utilization is less than average utilization – threshold. An over-utilized data node is a node whose utilization is greater than average utilization + threshold. Smaller threshold values will achieve more evenly balanced nodes, but would take more time for the rebalancing. Default threshold value is 10 percent.>bin/start-balancer.sh –threshold 15
bin/stop-balancer.sh
command.$HADOOP_HOME
/logs/hadoop-*-balancer*.out
file.