Aggregative values (for example, Mean, Max, Min, standard deviation, and so on) provide the basic analytics about a dataset. You may perform these calculations, either for the whole dataset or a part of the dataset.
In this recipe, we will use Hadoop to calculate the minimum, maximum, and average size of a file downloaded from the NASA servers, by processing the NASA weblog dataset. The following figure shows a summary of the execution:
As shown in the figure, mapper task will emit all message sizes under the key msgSize, and they are all sent to a one-reducer job. Then the reducer will walk through all of the data and will calculate the aggregate values.
HADOOP_HOME
to refer to the Hadoop installation folder.The following steps describe how to use MapReduce to calculate simple analytics about the weblog dataset:
DATA_DIR
.HADOOP_HOME
. If /data
is already there, clean it up:>bin/hadoopdfs -mkdir /data > bin/hadoopdfs -mkdir /data/input1 > bin/hadoopdfs -put <DATA_DIR>/NASA_access_log_Jul95 /data/input1
chapter6.zip
). We will call that folder CHAPTER_6_SRC
.hadoop.home
property in the CHAPTER_6_SRC/build.xml
file to point to your Hadoop installation folder.ant build
command from the CHAPTER_6_SRC
folder.build/lib/hadoop-cookbook-chapter6.jar
to your HADOOP_HOME
.HADOOP_HOME
:>bin/hadoop jar hadoop-cookbook-chapter6.jar chapter6.WebLogMessageSizeAggregator/data/input1 /data/output1
$bin/hadoopdfs -cat /data/output1/*
You will see that it will print the results as following:
Mean 1150 Max 6823936 Min 0
You can find the source for the recipe from src/chapter6/WebLogMessageSizeAggregator.java
.
HTTP logs follow a standard pattern where each log looks like the following. Here the last token includes the size of the web page retrieved:
205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985
We will use the Java regular expressions' support to parse the log lines, and the Pattern.compile()
method in the top of the class defines the regular expression. Since most Hadoop jobs involve text processing, regular expressions are a very useful tool while writing Hadoop Jobs:
private final static IntWritable one = new IntWritable(1); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { Matcher matcher = httplogPattern.matcher(value. toString()); if (matcher.matches()) { int size = Integer.parseInt(matcher.group(5)); context.write(new Text("msgSize"),one); } }
The map task receives each line in the log file as a different key-value pair. It parses the lines using regular expressions and emits the file size against the key msgSize
.
Then, Hadoop collects all values for the key and invokes the reducer. Reducer walks through all the values and calculates the minimum, maximum, and mean file size of the file downloaded from the web server. It is worth noting that by making the values available as an iterator, Hadoop gives the programmer a chance to process the data without storing them in memory. You should therefore try to process values without storing them in memory whenever possible.
public static class AReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,InterruptedException { double tot = 0; int count = 0; int min = Integer.MAX_VALUE; int max = 0; Iterator<IntWritable> iterator = values.iterator(); while (iterator.hasNext()) { int value = iterator.next().get(); tot = tot + value; count++; if (value < min) { min = value; } if (value > max) { max = value; } } context.write(new Text("Mean"), new IntWritable((int) tot / count)); context.write(new Text("Max"), new IntWritable(max)); context.write(new Text("Min"), new IntWritable(min)); } }
The main()
method of the job looks similar to the WordCount example, except for the highlighted lines that has been changed to accommodate the input and output datatype changes:
Job job = new Job(conf, "LogProcessingMessageSizeAggregation"); job.setJarByClass(WebLogMessageSizeAggregator.class); job.setMapperClass(AMapper.class); job.setReducerClass(AReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1);
You can learn more about Java regular expressions from the Java tutorial, http://docs.oracle.com/javase/tutorial/essential/regex/.