Running the WordCount program in a distributed cluster environment

This recipe describes how to run a job in a distributed cluster.

Getting ready

Start the Hadoop cluster.

How to do it...

Now let us run the WordCount sample in the distributed Hadoop setup.

  1. To use as inputs to the WordCount MapReduce sample that we wrote in the earlier recipe, copy the README.txt file in your Hadoop distribution to the HDFS filesystem at the location /data/input1.
    >bin/hadoop dfs -mkdir /data/
    >bin/hadoop dfs -mkdir /data/input1
    >bin/hadoop dfs -put README.txt /data/input1/README.txt
    >bin/hadoop dfs -ls /data/input1
    
    Found 1 items
    -rw-r--r--   1 srinath supergroup       1366 2012-04-09 08:59 /data/input1/README.txt
    
  2. Now, let's run the WordCount example from the HADOOP_HOME directory.
    >bin/hadoop jar hadoop-examples-1.0.0.jar wordcount /data/input1 /data/output1
    
    12/04/09 09:04:25 INFO input.FileInputFormat: Total input paths to process : 1
    12/04/09 09:04:26 INFO mapred.JobClient: Running job: job_201204090847_0001
    12/04/09 09:04:27 INFO mapred.JobClient:  map 0% reduce 0%
    12/04/09 09:04:42 INFO mapred.JobClient:  map 100% reduce 0%
    12/04/09 09:04:54 INFO mapred.JobClient:  map 100% reduce 100%
    12/04/09 09:04:59 INFO mapred.JobClient: Job complete: job_201204090847_0001
    .....
    
  3. Run the following commands to list the output directory and then look at the results.
    >bin/hadoop dfs -ls /data/output1
    
    Found 3 items
    -rw-r--r--   1 srinath supergroup          0 2012-04-09 09:04 /data/output1/_SUCCESS
    drwxr-xr-x   - srinath supergroup          0 2012-04-09 09:04 /data/output1/_logs
    -rw-r--r--   1 srinath supergroup       1306 2012-04-09 09:04 /data/output1/part-r-00000
    
    >bin/hadoop dfs -cat /data/output1/*
    
    (BIS),  1
    (ECCN)	  1
    (TSU)  1
    (see  1
    5D002.C.1,  1
    740.13)  1
    

How it works...

Job submission to the distributed Hadoop works in a similar way to the job submissions to local Hadoop installation, as described in the Writing a WordCount MapReduce sample, bundling it and running it using standalone Hadoop recipe. However, there are two main differences.

First, Hadoop stores both the inputs for the jobs and output generated by the job in HDFS filesystem. Therefore, we use step 1 to store the inputs in the HDFS filesystem and we use step 3 read outputs from the HDFS filesystem.

Secondly, when job is submitted, local Hadoop installation runs the job as a local JVM execution. However, the distributed cluster submits it to the JobTracker, and it executes the job using nodes in the distributed Hadoop cluster.

There's more...

You can see the results of the WordCount application also through the HDFS monitoring UI, as described in the Using HDFS monitoring UI recipe, and also you can see the statistics about the WordCount job as explained in the next recipe, Using MapReduce Monitoring UI.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset