Setting up HDFS

HDFS is the distributed filesystem that is available with Hadoop. MapReduce tasks use HDFS to read and write data. HDFS deployment includes a single NameNode and multiple DataNodes.

Setting up HDFS

For the HDFS setup, we need to configure NameNodes and DataNodes, and then specify the DataNodes in the slaves file. When we start the NameNode, startup script will start the DataNodes.

Getting ready

You may follow this recipe either using a single machine or multiple machines. If you are using multiple machines, you should choose one machine as the master node where you will run the HDFS NameNode. If you are using a single machine, use it as both the NameNode as well as the DataNode.

  1. Install Java in all machines that will be used to set up the HDFS cluster.
  2. If you are using Windows machines, install Cygwin and SSH server in each machine. The link http://pigtail.net/LRP/printsrv/cygwin-sshd.html provides step-by-step instructions.

How to do it...

Now let us set up HDFS in the distributed mode.

  1. Enable SSH from master nodes to slave nodes. Check that you can login to the localhost and all other nodes using SSH without a passphrase by running one of the following commands:
    • >ssh localhost
    • >ssh IPaddress
  2. If the above command returns an error or asks for a password, create SSH keys by executing the following command:
    >ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
    

    Move the ~/.ssh/id_dsa.pub file to the all the nodes in the cluster. Then add the SSH keys to the ~/.ssh/authorized_keys file in each node by running the following command (if the authorized_keys file does not exist, run the following command. Else, skip to the cat command):

    >touch ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys
    

    Now with permissions set, add your key to the ~/.ssh/authorized_keys file.

    >cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
    

    Then you can log in with the following command:

    >ssh localhost
    

    This command creates an SSH key pair in the .ssh/ directory of the home directory, and registers the generated public key with SSH as a trusted key.

  3. In each machine, create a directory for storing HDFS data. Let's call that directory HADOOP_DATA_DIR. Now let us create two sub directories, HADOOP_DATA_DIR/data and HADOOP_DATA_DIR/name. Change the directory permissions to 755 by running the following command for each directory:
    >chmod 755 <name of dir>
    
  4. In the NameNode, change directory to the unzipped HADOOP_HOME directory. Then place the IP address of all slave nodes in the HADOOP_HOME/conf/slaves file, each on a separate line. When we start the NameNode, it will use the slaves file to start the DataNodes.
  5. In all machines, edit the HADOOP_HOME/conf/hadoop-env.sh file by uncommenting the JAVA_HOME line and pointing it to your local Java installation. For example, if Java is in /opt/jdk1.6, change the JAVA_HOME line to export JAVA_HOME=/opt/jdk1.6.
  6. Inside each node's HADOOP_HOME/conf directory, add the following code to the core-site.xml and hdfs-site.xml files. Before adding the configurations, replace the MASTER_NODE strings with the IP address of the master node and HADOOP_DATA_DIR with the directory you created in the first step.
    HADOOP_HOME/conf/core-site.xml
    
    <configuration>
    <property>
    <name>fs.default.name</name>
    <!-- URL of MasterNode/NameNode -->
    <value>hdfs://MASTER_NODE:9000/</value>
    </property>
    </configuration>
    
    HADOOP_HOME/conf/hdfs-site.xml
    
    <configuration>
    <property>
    <name>dfs.name.dir</name>
    <!-- Path to store namespace and transaction logs -->
    <value>HADOOP_DATA_DIR/name</value>
    </property>
    <property>
    <name>dfs.data.dir</name>
    <!-- Path to store data blocks in datanode -->
    <value>HADOOP_DATA_DIR/data</value>
    </property>
    </configuration>
  7. From the NameNode, run the following command to format a new filesystem:
    >bin/hadoop namenode –format
    
    12/04/09 08:44:50 INFO namenode.NameNode: STARTUP_MSG:
    /************************************************************
    
    12/04/09 08:44:51 INFO common.Storage: Storage directory /Users/srinath/playground/hadoop-book/hadoop-temp/dfs/name has been successfully formatted.
    12/04/09 08:44:51 INFO namenode.NameNode: SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at Srinath-s-MacBook-Pro.local/172.16.91.1
    ************************************************************/
    
  8. Start the HDFS setup with the following command:
    >bin/start-dfs.sh
    

    This command will first start a NameNode. It will then look at the HADOOP_HOME/conf/slaves file and start the DataNodes. It will print a message like the following to the console.

    starting namenode, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-namenode-node7.beta.out
    209.126.198.72: starting datanode, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-datanode-node7.beta.out
    209.126.198.71: starting datanode, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-datanode-node6.beta.out
    209.126.198.72: starting secondarynamenode, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-secondarynamenode-node7.beta.out
    

    Hadoop uses a centralized architecture for metadata. In this design, the NameNode holds the information of all the files and where the data blocks for each file are located. The NameNode is a single point of failure, and on failure it will stop all the operations of the HDFS cluster. To avoid this, Hadoop supports a secondary NameNode that will hold a copy of all data in NameNode. If the NameNode fails, the secondary NameNode takes its place.

  9. Access the link http://MASTER_NODE:50070/ and verify that you can see the HDFS startup page. Here, replace MASTER_NODE with the IP address of the master node running the HDFS NameNode.
  10. Finally, shut down the HDFS cluster using the following command:
    >bin/stop-dfs.sh
    

How it works...

When started, the NameNode will read the HADOOP_HOME/conf/slaves files, find the DataNodes that need to be started, start them, and set up the HDFS cluster. In the HDFS basic command line file operations recipe, we will explore how to use HDFS to store and manage files.

HDFS setup is only a part of the Hadoop installation. The Setting Hadoop in a distributed cluster environment recipe describes how to set up the rest of the Hadoop.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset