MapReduce is an excellent tool for transforming data into tab-separated values (TSV). Once the input data is loaded into HDFS, the entire Hadoop cluster can be utilized to transform large datasets in parallel. This recipe will demonstrate the method to extract records from Apache access logs and store those records as tab-separated values in HDFS.
You will need to download the
apache_clf.txt
dataset from the support page of the Packt website, http://www.packtpub.com/support, and place the file in HDFS.
Perform the following steps to transform Apache logs to TSV format using MapReduce:
private Pattern p = Pattern.compile("^([\d.]+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\w+) (.+?) (.+?)" (\d+) (\d+) "([^"]+|(.+?))" "([^"]+|(.+?))"", Pattern.DOTALL);
public class CLFMapper extends Mapper<Object, Text, Text, Text>{ private SimpleDateFormat dateFormatter = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss Z"); private Pattern p = Pattern.compile("^([\d.]+) (\S+) (\S+)" + " \[([\w:/]+\s[+\-]\d{4})\] "(\w+) (.+?) (.+?)" " + "(\d+) (\d+) "([^"]+|(.+?))" "([^"]+|(.+?))"", Pattern.DOTALL); private Text outputKey = new Text(); private Text outputValue = new Text(); @Override protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { String entry = value.toString(); Matcher m = p.matcher(entry); if (!m.matches()) { return; } Date date = null; try { date = dateFormatter.parse(m.group(4)); } catch (ParseException ex) { return; } outputKey.set(m.group(1)); //ip StringBuilder b = new StringBuilder(); b.append(date.getTime()); //timestamp b.append(' '), b.append(m.group(6)); //page b.append(' '), b.append(m.group(8)); //http status b.append(' '), b.append(m.group(9)); //bytes b.append(' '), b.append(m.group(12)); //useragent outputValue.set(b.toString()); context.write(outputKey, outputValue); } }
public class ParseWeblogs extends Configured implements Tool { public int run(String[] args) throws Exception { Path inputPath = new Path(args[0]); Path outputPath = new Path(args[1]); Configuration conf = getConf(); Job weblogJob = new Job(conf); weblogJob.setJobName("Weblog Transformer"); weblogJob.setJarByClass(getClass()); weblogJob.setNumReduceTasks(0); weblogJob.setMapperClass(CLFMapper.class); weblogJob.setMapOutputKeyClass(Text.class); weblogJob.setMapOutputValueClass(Text.class); weblogJob.setOutputKeyClass(Text.class); weblogJob.setOutputValueClass(Text.class); weblogJob.setInputFormatClass(TextInputFormat.class); weblogJob.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(weblogJob, inputPath); FileOutputFormat.setOutputPath(weblogJob, outputPath); if(weblogJob.waitForCompletion(true)) { return 0; } return 1; } public static void main( String[] args ) throws Exception { int returnCode = ToolRunner.run(new ParseWeblogs(), args); System.exit(returnCode); } }
$ hadoop jar myjar.jar com.packt.ch3.etl.ParseWeblogs /user/hadoop/apache_clf.txt /user/hadoop/apache_clf_tsv
We first created a mapper that was responsible for the extraction of the desired information we from the Apache weblogs and for emitting the extracted fields in a tab-separated format.
Next, we created a map-only job to transform the web server log data into a tab-separated format. The key-value pairs emitted from the mapper were stored in a file in HDFS.
By default, the
TextOutputFormat
class uses a tab to separate the key and value pairs. You can change the default separator by setting the mapred.textoutputformat.separator
property. For example, to separate the IP and the timestamp by a ',
', we could re-run the job using the
following command:
$ hadoop jar myjar.jar com.packt.ch3.etl.ParseWeblogs -Dmapred.textoutputformat.separator=',' /user/hadoop/apache_clf.txt /user/hadoop/csv
The tab-separated output from this recipe will be used in the following recipes: