Adding a new DataNode

This recipe shows how to add new nodes to an existing HDFS cluster without restarting the whole cluster, and how to force HDFS to rebalance after the addition of new nodes.

Getting ready

To get started, follow these steps:

  1. Install Hadoop on the new node and replicate the configuration files of your existing Hadoop cluster. You can use rsync to copy the Hadoop configuration from another node. For example:
    >rsync -a <master_node_ip>:hadoop-1.0.x/conf $HADOOP_HOME/conf
    
  2. Ensure that the master node of your Hadoop/HDFS cluster can perform password-less SSH to the new node. Password-less SSH setup is optional, if you are not planning on using the bin/*.sh scripts from the master node to start/stop the cluster.

How to do it...

The following steps will show you how to add a new DataNode to an existing HDFS cluster:

  1. Add the IP or the DNS of the new node to the $HADOOP_HOME/conf/slaves file in the master node.
  2. Start the DataNode in the newly added slave node by using the following command.
    >bin/hadoop-deamon.sh start datanode
    

    Tip

    You can also use the $HADOOP_HOME/bin/start-dfs.sh script from the master node to start the DataNode daemons in the newly added nodes. This is helpful if you are adding more than one new DataNodes to the cluster.

  3. Check the $HADOOP_HOME/logs/hadoop-*-datanode-*.log in the new slave node for any errors.

The preceding steps apply both to adding a new node as well as re-joining a node that has been crashed and restarted.

There's more...

Similarly, you can add a new node to the Hadoop MapReduce cluster as well.

  1. Start the TaskTracker in the new node using the following command:
    >bin/hadoop-deamon.sh start tasktracker
    
  2. Check the $HADOOP_HOME/logs/hadoop-*-tasktracker-*.log in the new slave node for any errors.

Rebalancing HDFS

When you add new nodes, HDFS will not rebalance automatically. However, HDFS provides a rebalancer tool that can be invoked manually. This tool will balance the data blocks across cluster up to an optional threshold percentage. Rebalancing would be very helpful if you are having space issues in the other existing nodes.

  1. Execute the following command. The optional –threshold parameter specifies the percentage of disk capacity leeway to consider when identifying a node as under- or over-utilized. An under-utilized data node is a node whose utilization is less than average utilization – threshold. An over-utilized data node is a node whose utilization is greater than average utilization + threshold. Smaller threshold values will achieve more evenly balanced nodes, but would take more time for the rebalancing. Default threshold value is 10 percent.
    >bin/start-balancer.sh –threshold 15
    
  2. Rebalancing can be stopped by executing the bin/stop-balancer.sh command.
  3. A summary of the rebalancing will be available at the $HADOOP_HOME/logs/hadoop-*-balancer*.out file.

See also

  • The Decommissioning data nodes recipe in this chapter.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset