Calculating scatter plots using MapReduce

Another useful tool while analyzing data is a Scatter plot. We use Scatter plot to find the relationship between two measurements (dimensions). It plots the two dimensions against each other.

For an example, this recipe analyzes the data to find the relationship between the size of the web pages and the number of hits received by the web page.

The following figure shows a summary of the execution. Here, the mapper calculates and emits the message size (rounded to 1024 bytes) as the key and one as the value. Then the reducer calculates the number of occurrences for each message size:

Calculating scatter plots using MapReduce

Getting ready

  • This recipe assumes that you have followed the first chapter and have installed Hadoop. We will use the HADOOP_HOME variable to refer to the Hadoop installation folder.
  • Start Hadoop by following the instructions in the first chapter.
  • This recipe assumes you are aware of how Hadoop processing works. If you have not already done so, you should follow the recipe Writing a WordCount MapReduce sample, bundling it and running it using standalone Hadoop from Chapter 1, Getting Hadoop Up and Running in a Cluster.

How to do it...

The following steps show how to use MapReduce to calculate the correlation between two datasets:

  1. Download the weblog dataset from ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz and unzip it. We will call this DATA_DIR.
  2. Upload the data to HDFS by running following commands from HADOOP_HOME. If /data is already there, clean it up:
    >bin/hadoopdfs -mkdir /data
    >bin/hadoopdfs -mkdir /data/input1
    >bin/hadoopdfs -put <DATA_DIR>/NASA_access_log_Jul95 /data/input1
    
  3. Unzip the source code of this chapter (chapter6.zip). We will call that folder CHAPTER_6_SRC.
  4. Change the hadoop.home property in the CHAPTER_6_SRC/build.xml file to point to your Hadoop installation folder.
  5. Compile the source by running the ant build command from the CHAPTER_6_SRC folder.
  6. Copy the build/lib/hadoop-cookbook-chapter6.jar file to your HADOOP_HOME.
  7. Run the MapReduce job through following command from HADOOP_HOME:
    > bin/hadoop jar hadoop-cookbook-chapter6.jar chapter6.WeblogMessagesizevsHitsProcessor/data/input1 /data/output5
    
  8. Read the results by running the following command:
    >bin/hadoopdfs -cat /data/output5/*
    
  9. Download the results of the last recipe to the local computer by running the following command from HADOOP_HOME:
    > bin/hadoopdfs -get /data/output5/part-r-00000 5.data
    
  10. Copy all the *.plot files from CHAPTER_6_SRC to HADOOP_HOME.
  11. Generate the plot by running the following command from HADOOP_HOME.
    >gnuplot httphitsvsmsgsize.plot
    
  12. It will generate a file called hitsbymsgSize.png, which will look like following screenshot:
    How to do it...

The plot shows a negative correlation between the number of hits and the size of the messages in the log scales, which also suggest a power law distribution.

How it works...

You can find the source for the recipe from src/chapter6/WeblogMessagesizevsHitsProcessor.java.

The following code segment shows the code for the mapper. Just like earlier recipes, we will use regular expressions to parse the log entries from log files:

public void map(Object key, Text value,
  Context context) throws IOException, InterruptedException
{
  Matcher matcher = httplogPattern.matcher(value.toString());
  if (matcher.matches())
  {
  int size = Integer.parseInt(matcher.group(5));
  context.write(new IntWritable(size / 1024), one);
  }
}

Map task receives each line in the log file as a different key-value pair. It parses the lines using regular expressions and emits the file size as 1024-byte blocks as the key and one as the value.

Then, Hadoop collects all key-value pairs, sorts them, and then invokes the reducer once for each key. Each reducer walks through the values and calculates the count of page accesses for each file size.

public void reduce(IntWritable key, Iterable<IntWritable> values,
  Context context) throws IOException, InterruptedException
{
  int sum = 0;
  for (IntWritableval : values)
  {
    sum += val.get();
  }
  context.write(key, new IntWritable(sum));
}

The main() method of the job looks similar to the earlier recipes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset