Hadoop supports three different operating modes:
This recipe will describe how to install and set up Hadoop to run in pseudo-distributed mode. In pseudo-distributed mode, all of the HDFS and MapReduce processes will start on a single node. Pseudo-distributed mode is an excellent environment to test your HDFS operations and/or your MapReduce applications on a subset of the data.
Ensure that you have Java 1.6, ssh
, and sshd
installed. In addition, the ssh
daemon (sshd
) should be running on the node. You can validate the installation of these applications by using the following commands:
$ java -version java version "1.6.0_31" Java(TM) SE Runtime Environment (build 1.6.0_31-b04) Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode) $ ssh usage: ssh [-1246AaCfgkMNnqsTtVvXxY] [-b bind_address] [-c cipher_spec] [-D [bind_address:]port] [-e escape_char] [-F configfile] [-i identity_file] [-L [bind_address:]port:host:hostport] [-l login_name] [-m mac_spec] [-O ctl_cmd] [-o option] [-p port] [-R [bind_address:]port:host:hostport] [-S ctl_path] [-w tunnel:tunnel] [user@]hostname [command] $ service sshd status openssh-daemon (pid 2004) is running...
Carry out the following steps to start Hadoop in pseudo-distributed mode:
JAVA_HOME
environment property is set to the folder of the system's Java installation:# useradd hadoop # passwd hadoop # su – hadoop $ echo $JAVA_HOME $ /usr/java/jdk1.6.0_31
ssh
public and private key pair to allow password-less login to the node using the Hadoop user account. When asked for a passphrase, hit the Enter key, ensuring no passphrase will be used:$ su – hadoop $ ssh-keygen –t rsa
$ ssh-copy-id –i /home/hadoop/.ssh/id_rsa.pub hadoop@localhost
ssh
login. You should be able to ssh
to localhost
using your hadoop
account without providing a password:$ ssh localhost
# su – hadoop $ tar –zxvf hadoop-0.20.x.tar.gz
conf
folder of the extracted Hadoop distribution. These configuration changes will allow Hadoop to run in pseudo-distributed mode:$ vi conf/core-site.xml <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:8020</value> </property> </configuration> $ vi conf/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> $ vi conf/mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> </configuration>
$ bin/hadoop namenode –format
$ bin/start-all.sh
http://localhost:50070/
, and the JobTracker page http://localhost:50030/
. You can stop all of the Hadoop services by running the bin/stop-all.sh
script.Steps 1 through 4 sets up a single node for a password-less login using ssh
.
Next, we downloaded a distribution of Hadoop and configured the distribution to run in pseudo-distributed mode. The
fs.default.name
property was set to a URI to tell Hadoop where to find the HDFS implementation, which is running on our local machine and listening on port 8020
. Next, we set the replication
factor of HDFS to 1
using the
dfs.replication
property. Since we are running all of the Hadoop services on a single node, there is no need to replicate any information. If we did, all of the replicated information would reside on the single node. We set the value of the last configuration property mapred.job.tracker
to localhost:8021
. The mapred.job.tracker
property tells Hadoop where to find the JobTracker.
Finally, we formatted the NameNode and started the Hadoop services. You need to format the NameNode after you set up a new Hadoop cluster. Formatting a NameNode will erase all of the data in the cluster.
By default, the Hadoop distribution comes configured to run in standalone mode. In standalone mode, there is no need to start any Hadoop service. In addition, input and output folders will be located on the local filesystem, instead of HDFS. To run a MapReduce job in standalone mode, use the configuration files that initially came with the distribution. Create an input folder on the local filesystem and use the Hadoop shell script:
$ mkdir input $ cp somefiles*.txt input/ $ /path/to/hadoop/bin/hadoop jar myjar.jar input/*.txt output