Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Running the WordCount program in a distributed cluster environment

This recipe describes how to run a job in a distributed cluster.

Getting ready

Start the Hadoop cluster.

How to do it...

Now let us run the WordCount sample in the distributed Hadoop setup.

To use as inputs to the WordCount MapReduce sample that we wrote in the earlier recipe, copy the README.txt file in your Hadoop distribution to the HDFS filesystem at the location /data/input1.

>bin/hadoop dfs -mkdir /data/
>bin/hadoop dfs -mkdir /data/input1
>bin/hadoop dfs -put README.txt /data/input1/README.txt
>bin/hadoop dfs -ls /data/input1

Found 1 items
-rw-r--r--   1 srinath supergroup       1366 2012-04-09 08:59 /data/input1/README.txt

Now, let's run the WordCount example from the HADOOP_HOME directory.

>bin/hadoop jar hadoop-examples-1.0.0.jar wordcount /data/input1 /data/output1

12/04/09 09:04:25 INFO input.FileInputFormat: Total input paths to process : 1
12/04/09 09:04:26 INFO mapred.JobClient: Running job: job_201204090847_0001
12/04/09 09:04:27 INFO mapred.JobClient:  map 0% reduce 0%
12/04/09 09:04:42 INFO mapred.JobClient:  map 100% reduce 0%
12/04/09 09:04:54 INFO mapred.JobClient:  map 100% reduce 100%
12/04/09 09:04:59 INFO mapred.JobClient: Job complete: job_201204090847_0001
.....

Run the following commands to list the output directory and then look at the results.

>bin/hadoop dfs -ls /data/output1

Found 3 items
-rw-r--r--   1 srinath supergroup          0 2012-04-09 09:04 /data/output1/_SUCCESS
drwxr-xr-x   - srinath supergroup          0 2012-04-09 09:04 /data/output1/_logs
-rw-r--r--   1 srinath supergroup       1306 2012-04-09 09:04 /data/output1/part-r-00000

>bin/hadoop dfs -cat /data/output1/*

(BIS),  1
(ECCN)	  1
(TSU)  1
(see  1
5D002.C.1,  1
740.13)  1

How it works...

Job submission to the distributed Hadoop works in a similar way to the job submissions to local Hadoop installation, as described in the Writing a WordCount MapReduce sample, bundling it and running it using standalone Hadoop recipe. However, there are two main differences.

First, Hadoop stores both the inputs for the jobs and output generated by the job in HDFS filesystem. Therefore, we use step 1 to store the inputs in the HDFS filesystem and we use step 3 read outputs from the HDFS filesystem.

Secondly, when job is submitted, local Hadoop installation runs the job as a local JVM execution. However, the distributed cluster submits it to the JobTracker, and it executes the job using nodes in the distributed Hadoop cluster.

There's more...

You can see the results of the WordCount application also through the HDFS monitoring UI, as described in the Using HDFS monitoring UI recipe, and also you can see the statistics about the WordCount job as explained in the next recipe, Using MapReduce Monitoring UI.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Running the WordCount program in a distributed cluster environment

Create new playlist

Sign In

Sign Up

Running the WordCount program in a distributed cluster environment

Getting ready

How to do it...

How it works...

There's more...

Table of Contents for
Running the WordCount program in a distributed cluster environment