Adding the combiner step to the WordCount MapReduce program

After running the map function, if there are many key-value pairs with the same key, Hadoop has to move all those values to the reduce function. This can incur a significant overhead. To optimize such scenarios, Hadoop supports a special function called combiner . If provided, Hadoop will call the combiner from the same node as the map node before invoking the reducer and after running the mapper. This can significantly reduce the amount of data transferred to the reduce step.

This recipe explains how to use the combiner with the WordCount sample introduced in the previous recipe.

How to do it...

Now let us run the MapReduce job adding the combiner:

  1. Combiner must have the same interface as the reduce function. For the WordCount sample, we will reuse the reduce function as the combiner.
  2. To ask the MapReduce job to use the combiner, let us uncomment the line //job.setCombinerClass(IntSumReducer.class); in the sample and recompile the code.
  3. Copy the hadoop-cookbook-chapter1.jar file to the HADOOP_HOME directory and run the WordCount as done in the earlier recipe. Make sure to delete the old output directory before running the job.
  4. Final results will be available from the output directory.

How it works...

To activate a combiner, users should provide a mapper, a reducer, and a combiner as input to the MapReduce job. In that setting, Hadoop executes the combiner in the same node as the mapper function just after running the mapper. With this method, the combiner can pre-process the data generated by the mapper before sending it to the reducer, thus reducing the amount of data that is getting transferred.

For example, with the WordCount, combiner receives (word,1) pairs from the map step as input and outputs a single (word, N) pair. For example, if an input document has 10,000 occurrences of word "the", the mapper will generate 10,000 (the,1) pairs, while the combiner will generate one (the,10,000) thus reducing the amount of data transferred to the reduce task.

However, the combiner only works with commutative and associative functions. For example, the same idea does not work when calculating mean. As mean is not communicative and associative, a combiner in that case will yield a wrong result.

There's more...

Although in the sample we reused the reduce function implementation as the combiner function, you may write your own combiner function just like we did for the map and reduce functions in the previous recipe. However, the signature of the combiner function must be identical to that of the reduce function.

In a local setup, using a combiner will not yield significant gains. However, in the distributed setups as described in Setting Hadoop in a distributed cluster environment recipe, combiner can give significant gains.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset