Another useful tool while analyzing data is a Scatter plot. We use Scatter plot to find the relationship between two measurements (dimensions). It plots the two dimensions against each other.
For an example, this recipe analyzes the data to find the relationship between the size of the web pages and the number of hits received by the web page.
The following figure shows a summary of the execution. Here, the mapper calculates and emits the message size (rounded to 1024 bytes) as the key and one
as the value. Then the reducer calculates the number of occurrences for each message size:
HADOOP_HOME
variable to refer to the Hadoop installation folder.The following steps show how to use MapReduce to calculate the correlation between two datasets:
DATA_DIR
.HADOOP_HOME
. If /data
is already there, clean it up:>bin/hadoopdfs -mkdir /data >bin/hadoopdfs -mkdir /data/input1 >bin/hadoopdfs -put <DATA_DIR>/NASA_access_log_Jul95 /data/input1
chapter6.zip
). We will call that folder CHAPTER_6_SRC
.hadoop.home
property in the CHAPTER_6_SRC/build.xml
file to point to your Hadoop installation folder.ant build
command from the CHAPTER_6_SRC
folder.build/lib/hadoop-cookbook-chapter6.jar
file to your HADOOP_HOME
.HADOOP_HOME
:> bin/hadoop jar hadoop-cookbook-chapter6.jar chapter6.WeblogMessagesizevsHitsProcessor/data/input1 /data/output5
>bin/hadoopdfs -cat /data/output5/*
HADOOP_HOME
:> bin/hadoopdfs -get /data/output5/part-r-00000 5.data
*.plot
files from CHAPTER_6_SRC
to HADOOP_HOME
.HADOOP_HOME
.>gnuplot httphitsvsmsgsize.plot
hitsbymsgSize.png
, which will look like following screenshot:The plot shows a negative correlation between the number of hits and the size of the messages in the log scales, which also suggest a power law distribution.
You can find the source for the recipe from src/chapter6/WeblogMessagesizevsHitsProcessor.java
.
The following code segment shows the code for the mapper. Just like earlier recipes, we will use regular expressions to parse the log entries from log files:
public void map(Object key, Text value, Context context) throws IOException, InterruptedException { Matcher matcher = httplogPattern.matcher(value.toString()); if (matcher.matches()) { int size = Integer.parseInt(matcher.group(5)); context.write(new IntWritable(size / 1024), one); } }
Map task receives each line in the log file as a different key-value pair. It parses the lines using regular expressions and emits the file size as 1024-byte blocks as the key and one
as the value.
Then, Hadoop collects all key-value pairs, sorts them, and then invokes the reducer once for each key. Each reducer walks through the values and calculates the count of page accesses for each file size.
public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritableval : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }
The main()
method of the job looks similar to the earlier recipes.