Another interesting view into a dataset is a histogram. Histogram makes sense only under a continuous dimension (for example, access time and file size). It groups the number of occurrences of some event into several groups in the dimension. For example, in this recipe, if we take the access time from weblogs as the dimension, then we will group the access time by the hour.
The following figure shows a summary of the execution. Here the mapper calculates the hour of the day and emits the "hour of the day" and 1 as the key and value respectively. Then each reducer receives all the occurrences of one hour of a day, and calculates the number of occurrences:
HADOOP_HOME
variable to refer to the Hadoop installation folder.The following steps show how to calculate and plot a Histogram:
DATA_DIR
.HADOOP_HOME
. If /data
is already there, clean it up:>bin/hadoopdfs -mkdir /data >bin/hadoopdfs -mkdir /data/input1 >bin/hadoopdfs -put <DATA_DIR>/NASA_access_log_Jul95 /data/input1
chapter6.zip
). We will call that folder CHAPTER_6_SRC
.hadoop.home
property in the CHAPTER_6_SRC/build.xml
file to point to your Hadoop installation folder.ant build
command from the CHAPTER_6_SRC
folder.build/lib/hadoop-cookbook-chapter6.jar
file to HADOOP_HOME
.HADOOP_HOME
:> bin/hadoop jar hadoop-cookbook-chapter6.jar chapter6.WeblogTimeOfDayHistogramCreator/data/input1 /data/output4
>bin/hadoopdfs -cat /data/output4/*
HADOOP_HOME
:> bin/hadoopdfs -get /data/output4/part-r-00000 3.data
*.plot
files from CHAPTER_6_SRC
to HADOOP_HOME
.HADOOP_HOME
:>gnuplot httphistbyhour.plot
hitsbyHour.png
, which will look like following:As you can see from the figure, most of the access to NASA is at night, whereas there noontime. Also, two peaks roughly follow the tea times.
You can find the source for the recipe from src/chapter6/WeblogTimeOfDayHistogramCreator.java
. As explained in the first recipe of this chapter, we will use regular expressions to parse the log file and extract the access time from the log files.
The following code segment shows the mapper function:
public void map(Object key, Text value, Context context) throws IOException, InterruptedException { Matcher matcher = httplogPattern.matcher(value.toString()); if (matcher.matches()) { String timeAsStr = matcher.group(2); Date time = dateFormatter.parse(timeAsStr); Calendar calendar = GregorianCalendar.getInstance(); calendar.setTime(time); int hours = calendar.get(Calendar.HOUR_OF_DAY); context.write(new IntWritable(hours), one); } }
Map task receives each line in the log file as a different key-value pair. It parses the lines using regular expressions and extracts the access time for each web page access. Then, the mapper function extracts the hour of the day from the access time and emits the hour of the day and one
as output of the mapper function.
Then, Hadoop collects all key-value pairs, sorts them, and then invokes the reducer once for each key. Each reducer walks through the values and calculates the count of page accesses for each hour.
public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritableval : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); }
The main()
method of the job looks similar to the WordCount example as described in the earlier recipe.