Performing Group-By using MapReduce

This recipe shows how we can use MapReduce to group data into simple groups and calculate the analytics for each group. We will use the same HTTP log dataset. The following figure shows a summary of the execution:

Performing Group-By using MapReduce

As shown in the figure, the mapper task groups the occurrence of each link under different keys. Then, Hadoop sorts the keys and provides all values for a given key to a reducer, who will count the number of occurrences.

Getting ready

  • This recipe assumes that you have followed the first chapter and have installed Hadoop. We will use the HADOOP_HOME to refer to the Hadoop installation folder.
  • Start Hadoop following the instructions in the first chapter.
  • This recipe assumes that you are aware of how Hadoop processing works. If you have not already done so, you should follow the recipe Writing a WordCount MapReduce sample, bundling it and running it using standalone Hadoop from Chapter 1, Getting Hadoop Up and Running in a Cluster.

How to do it...

The following steps show how we can group weblog data and calculate analytics.

  1. Download the weblog dataset from ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz and unzip it. We will call this DATA_DIR.
  2. Upload the data to HDFS by running the following commands from HADOOP_HOME. If /data is already there, clean it up:
    >bin/hadoopdfs -mkdir /data
    >bin/hadoopdfs -mkdir /data/input1
    >bin/hadoopdfs -put <DATA_DIR>/NASA_access_log_Jul95 /data/input1
    
  3. Unzip the source code of this chapter (chapter6.zip). We will call that folder CHAPTER_6_SRC.
  4. Change the hadoop.home property in the CHAPTER_6_SRC/build.xml file to point to your Hadoop installation folder.
  5. Compile the source by running the ant build command from the CHAPTER_6_SRC folder.
  6. Copy the build/lib/hadoop-cookbook-chapter6.jar to HADOOP_HOME.
  7. Run the MapReduce job using the following command from HADOOP_HOME:
    >bin/hadoop jar hadoop-cookbook-chapter6.jar chapter6.WeblogHitsByLinkProcessor/data/input1 /data/output2
    
  8. Read the results by running the following command:
    >bin/hadoopdfs -cat /data/output2/*
    

    You will see that it will print the results as following:

    /base-ops/procurement/procurement.html  28
    /biomed/                                1
    /biomed/bibliography/biblio.html        7
    /biomed/climate/airqual.html            4
    /biomed/climate/climate.html            5
    /biomed/climate/gif/f16pcfinmed.gif     4
    /biomed/climate/gif/f22pcfinmed.gif     3
    /biomed/climate/gif/f23pcfinmed.gif     3
    /biomed/climate/gif/ozonehrlyfin.gif    3
    

How it works...

You can find the source for the recipe from src/chapter6/WeblogHitsByLinkProcessor.java.

As described in the earlier recipe, we will use regular expressions to parse HTTP logs. In the following sample the log line /shuttle/countdown/countdown.html shows the link (URL) being retrieved.

205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985

The following code segment shows the mapper:

public void map(Object key, Text value, 
  Context context) throws IOException, 
  InterruptedException
{
  Matcher matcher = httplogPattern.matcher(value.toString());
  if(matcher.matches())
  {
    String linkUrl = matcher.group(4);
    word.set(linkUrl);
    context.write(word, one);
   }
}

Map task receives each line in the log file as a different key-value pair. It parses the lines using regular expressions and emits the link as the key, and number one as the value.

Then, Hadoop collects all values for different keys (link) and invokes the reducer once for each link. Then each Reducer counts the number of hits for each link.

public void reduce(Text key, Iterable<IntWritable> values, 
  Context context) throws IOException, InterruptedException
{
  int sum = 0;
  for (IntWritableval : values)
  {
    sum += val.get();
  }
  result.set(sum);
  context.write(key, result);
}

The main() method of the job works similar to the earlier recipe.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset