HDFS is the distributed filesystem that is available with Hadoop. MapReduce tasks use HDFS to read and write data. HDFS deployment includes a single NameNode and multiple DataNodes.
For the HDFS setup, we need to configure NameNodes and DataNodes, and then specify the DataNodes in the slaves
file. When we start the NameNode, startup script will start the DataNodes.
You may follow this recipe either using a single machine or multiple machines. If you are using multiple machines, you should choose one machine as the master node where you will run the HDFS NameNode. If you are using a single machine, use it as both the NameNode as well as the DataNode.
Now let us set up HDFS in the distributed mode.
>ssh localhost
>ssh IPaddress
>ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
Move the ~/.ssh/id_dsa.pub
file to the all the nodes in the cluster. Then add the SSH keys to the ~/.ssh/authorized_keys
file in each node by running the following command (if the authorized_keys
file does not exist, run the following command. Else, skip to the cat
command):
>touch ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys
Now with permissions set, add your key to the ~/.ssh/authorized_keys
file.
>cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Then you can log in with the following command:
>ssh localhost
This command creates an SSH key pair in the .ssh/
directory of the home directory, and registers the generated public key with SSH as a trusted key.
HADOOP_DATA_DIR
. Now let us create two sub directories, HADOOP_DATA_DIR/data
and HADOOP_DATA_DIR/name
. Change the directory permissions to 755
by running the following command for each directory:>chmod 755 <name of dir>
HADOOP_HOME
directory. Then place the IP address of all slave nodes in the HADOOP_HOME/conf/slaves
file, each on a separate line. When we start the NameNode, it will use the slaves
file to start the DataNodes.HADOOP_HOME/conf/hadoop-env.sh
file by uncommenting the JAVA_HOME
line and pointing it to your local Java installation. For example, if Java is in /opt/jdk1.6
, change the JAVA_HOME
line to export JAVA_HOME=/opt/jdk1.6
.HADOOP_HOME/conf
directory, add the following code to the core-site.xml
and hdfs-site.xml
files. Before adding the configurations, replace the MASTER_NODE
strings with the IP address of the master node and HADOOP_DATA_DIR
with the directory you created in the first step.HADOOP_HOME/conf/core-site.xml <configuration> <property> <name>fs.default.name</name> <!-- URL of MasterNode/NameNode --> <value>hdfs://MASTER_NODE:9000/</value> </property> </configuration> HADOOP_HOME/conf/hdfs-site.xml <configuration> <property> <name>dfs.name.dir</name> <!-- Path to store namespace and transaction logs --> <value>HADOOP_DATA_DIR/name</value> </property> <property> <name>dfs.data.dir</name> <!-- Path to store data blocks in datanode --> <value>HADOOP_DATA_DIR/data</value> </property> </configuration>
>bin/hadoop namenode –format 12/04/09 08:44:50 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ … 12/04/09 08:44:51 INFO common.Storage: Storage directory /Users/srinath/playground/hadoop-book/hadoop-temp/dfs/name has been successfully formatted. 12/04/09 08:44:51 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at Srinath-s-MacBook-Pro.local/172.16.91.1 ************************************************************/
>bin/start-dfs.sh
This command will first start a NameNode. It will then look at the HADOOP_HOME/conf/slaves
file and start the DataNodes. It will print a message like the following to the console.
starting namenode, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-namenode-node7.beta.out 209.126.198.72: starting datanode, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-datanode-node7.beta.out 209.126.198.71: starting datanode, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-datanode-node6.beta.out 209.126.198.72: starting secondarynamenode, logging to /root/hadoop-setup-srinath/hadoop-1.0.0/libexec/../logs/hadoop-root-secondarynamenode-node7.beta.out
Hadoop uses a centralized architecture for metadata. In this design, the NameNode holds the information of all the files and where the data blocks for each file are located. The NameNode is a single point of failure, and on failure it will stop all the operations of the HDFS cluster. To avoid this, Hadoop supports a secondary NameNode that will hold a copy of all data in NameNode. If the NameNode fails, the secondary NameNode takes its place.
http://MASTER_NODE:50070/
and verify that you can see the HDFS startup page. Here, replace MASTER_NODE
with the IP address of the master node running the HDFS NameNode.>bin/stop-dfs.sh
When started, the NameNode will read the HADOOP_HOME/conf/slaves
files, find the DataNodes that need to be started, start them, and set up the HDFS cluster. In the HDFS basic command line file operations recipe, we will explore how to use HDFS to store and manage files.
HDFS setup is only a part of the Hadoop installation. The Setting Hadoop in a distributed cluster environment recipe describes how to set up the rest of the Hadoop.