This recipe shows how we can use MapReduce to group data into simple groups and calculate the analytics for each group. We will use the same HTTP log dataset. The following figure shows a summary of the execution:
As shown in the figure, the mapper task groups the occurrence of each link under different keys. Then, Hadoop sorts the keys and provides all values for a given key to a reducer, who will count the number of occurrences.
HADOOP_HOME
to refer to the Hadoop installation folder.The following steps show how we can group weblog data and calculate analytics.
DATA_DIR
.HADOOP_HOME
. If /data
is already there, clean it up:>bin/hadoopdfs -mkdir /data >bin/hadoopdfs -mkdir /data/input1 >bin/hadoopdfs -put <DATA_DIR>/NASA_access_log_Jul95 /data/input1
chapter6.zip
). We will call that folder CHAPTER_6_SRC
.hadoop.home
property in the CHAPTER_6_SRC/build.xml
file to point to your Hadoop installation folder.ant build
command from the CHAPTER_6_SRC
folder.build/lib/hadoop-cookbook-chapter6.jar
to HADOOP_HOME
.HADOOP_HOME
:>bin/hadoop jar hadoop-cookbook-chapter6.jar chapter6.WeblogHitsByLinkProcessor/data/input1 /data/output2
>bin/hadoopdfs -cat /data/output2/*
You will see that it will print the results as following:
/base-ops/procurement/procurement.html 28 /biomed/ 1 /biomed/bibliography/biblio.html 7 /biomed/climate/airqual.html 4 /biomed/climate/climate.html 5 /biomed/climate/gif/f16pcfinmed.gif 4 /biomed/climate/gif/f22pcfinmed.gif 3 /biomed/climate/gif/f23pcfinmed.gif 3 /biomed/climate/gif/ozonehrlyfin.gif 3
You can find the source for the recipe from src/chapter6/WeblogHitsByLinkProcessor.java
.
As described in the earlier recipe, we will use regular expressions to parse HTTP logs. In the following sample the log line /shuttle/countdown/countdown.html
shows the link (URL) being retrieved.
205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985
The following code segment shows the mapper:
public void map(Object key, Text value, Context context) throws IOException, InterruptedException { Matcher matcher = httplogPattern.matcher(value.toString()); if(matcher.matches()) { String linkUrl = matcher.group(4); word.set(linkUrl); context.write(word, one); } }
Map task receives each line in the log file as a different key-value pair. It parses the lines using regular expressions and emits the link as the key, and number one
as the value.
Then, Hadoop collects all values for different keys (link) and invokes the reducer once for each link. Then each Reducer counts the number of hits for each link.
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritableval : values) { sum += val.get(); } result.set(sum); context.write(key, result); }
The main()
method of the job works similar to the earlier recipe.