Calculating histograms using MapReduce

Another interesting view into a dataset is a histogram. Histogram makes sense only under a continuous dimension (for example, access time and file size). It groups the number of occurrences of some event into several groups in the dimension. For example, in this recipe, if we take the access time from weblogs as the dimension, then we will group the access time by the hour.

The following figure shows a summary of the execution. Here the mapper calculates the hour of the day and emits the "hour of the day" and 1 as the key and value respectively. Then each reducer receives all the occurrences of one hour of a day, and calculates the number of occurrences:

Calculating histograms using MapReduce

Getting ready

  • This recipe assumes that you have followed the first chapter and have installed Hadoop. We will use the HADOOP_HOME variable to refer to the Hadoop installation folder.
  • Start Hadoop by following the instructions in the first chapter.
  • This recipe assumes that you are aware of how Hadoop processing works. If you have not already done so, you should follow the recipe Writing a WordCount MapReduce sample, bundling it and running it using standalone Hadoop from Chapter 1, Getting Hadoop Up and Running in a Cluster.

How to do it...

The following steps show how to calculate and plot a Histogram:

  1. Download the weblog dataset from ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz and unzip it. We will call this DATA_DIR.
  2. Upload the data to HDFS by running the following commands from HADOOP_HOME. If /data is already there, clean it up:
    >bin/hadoopdfs -mkdir /data
    >bin/hadoopdfs -mkdir /data/input1
    >bin/hadoopdfs -put <DATA_DIR>/NASA_access_log_Jul95 /data/input1
    
  3. Unzip the source code of this chapter (chapter6.zip). We will call that folder CHAPTER_6_SRC.
  4. Change the hadoop.home property in the CHAPTER_6_SRC/build.xml file to point to your Hadoop installation folder.
  5. Compile the source by running the ant build command from the CHAPTER_6_SRC folder.
  6. Copy the build/lib/hadoop-cookbook-chapter6.jar file to HADOOP_HOME.
  7. Run the MapReduce job through the following command from HADOOP_HOME:
    > bin/hadoop jar hadoop-cookbook-chapter6.jar chapter6.WeblogTimeOfDayHistogramCreator/data/input1 /data/output4
    
  8. Read the results by running the following command:
    >bin/hadoopdfs -cat /data/output4/*
    
  9. Download the results of the last recipe to a local computer by running the following command from HADOOP_HOME:
    > bin/hadoopdfs -get /data/output4/part-r-00000 3.data
    
  10. Copy all the *.plot files from CHAPTER_6_SRC to HADOOP_HOME.
  11. Generate the plot by running the following command from HADOOP_HOME:
    >gnuplot httphistbyhour.plot
    
  12. It will generate a file called hitsbyHour.png, which will look like following:
    How to do it...

As you can see from the figure, most of the access to NASA is at night, whereas there noontime. Also, two peaks roughly follow the tea times.

How it works...

You can find the source for the recipe from src/chapter6/WeblogTimeOfDayHistogramCreator.java. As explained in the first recipe of this chapter, we will use regular expressions to parse the log file and extract the access time from the log files.

The following code segment shows the mapper function:

public void map(Object key, Text value, 
  Context context) throws IOException, InterruptedException
{
  Matcher matcher = httplogPattern.matcher(value.toString());
  if (matcher.matches())
  {
    String timeAsStr = matcher.group(2);
    Date time = dateFormatter.parse(timeAsStr);
    Calendar calendar = GregorianCalendar.getInstance();
    calendar.setTime(time);
    int hours = calendar.get(Calendar.HOUR_OF_DAY);
    context.write(new IntWritable(hours), one); 
  }
}

Map task receives each line in the log file as a different key-value pair. It parses the lines using regular expressions and extracts the access time for each web page access. Then, the mapper function extracts the hour of the day from the access time and emits the hour of the day and one as output of the mapper function.

Then, Hadoop collects all key-value pairs, sorts them, and then invokes the reducer once for each key. Each reducer walks through the values and calculates the count of page accesses for each hour.

public void reduce(IntWritable key,
  Iterable<IntWritable> values,
  Context context) throws IOException, InterruptedException
  {
    int sum = 0;
    for (IntWritableval : values) 
    {
      sum += val.get();
    }
    context.write(key, new IntWritable(sum));
  }

The main() method of the job looks similar to the WordCount example as described in the earlier recipe.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset