This recipe describes how to run a job in a distributed cluster.
Now let us run the WordCount sample in the distributed Hadoop setup.
README.txt
file in your Hadoop distribution to the HDFS filesystem at the location /data/input1
.>bin/hadoop dfs -mkdir /data/ >bin/hadoop dfs -mkdir /data/input1 >bin/hadoop dfs -put README.txt /data/input1/README.txt >bin/hadoop dfs -ls /data/input1 Found 1 items -rw-r--r-- 1 srinath supergroup 1366 2012-04-09 08:59 /data/input1/README.txt
HADOOP_HOME
directory.>bin/hadoop jar hadoop-examples-1.0.0.jar wordcount /data/input1 /data/output1 12/04/09 09:04:25 INFO input.FileInputFormat: Total input paths to process : 1 12/04/09 09:04:26 INFO mapred.JobClient: Running job: job_201204090847_0001 12/04/09 09:04:27 INFO mapred.JobClient: map 0% reduce 0% 12/04/09 09:04:42 INFO mapred.JobClient: map 100% reduce 0% 12/04/09 09:04:54 INFO mapred.JobClient: map 100% reduce 100% 12/04/09 09:04:59 INFO mapred.JobClient: Job complete: job_201204090847_0001 .....
>bin/hadoop dfs -ls /data/output1 Found 3 items -rw-r--r-- 1 srinath supergroup 0 2012-04-09 09:04 /data/output1/_SUCCESS drwxr-xr-x - srinath supergroup 0 2012-04-09 09:04 /data/output1/_logs -rw-r--r-- 1 srinath supergroup 1306 2012-04-09 09:04 /data/output1/part-r-00000 >bin/hadoop dfs -cat /data/output1/* (BIS), 1 (ECCN) 1 (TSU) 1 (see 1 5D002.C.1, 1 740.13) 1
Job submission to the distributed Hadoop works in a similar way to the job submissions to local Hadoop installation, as described in the Writing a WordCount MapReduce sample, bundling it and running it using standalone Hadoop recipe. However, there are two main differences.
First, Hadoop stores both the inputs for the jobs and output generated by the job in HDFS filesystem. Therefore, we use step 1 to store the inputs in the HDFS filesystem and we use step 3 read outputs from the HDFS filesystem.
Secondly, when job is submitted, local Hadoop installation runs the job as a local JVM execution. However, the distributed cluster submits it to the JobTracker, and it executes the job using nodes in the distributed Hadoop cluster.