Transforming Apache logs into TSV format using MapReduce

MapReduce is an excellent tool for transforming data into tab-separated values (TSV). Once the input data is loaded into HDFS, the entire Hadoop cluster can be utilized to transform large datasets in parallel. This recipe will demonstrate the method to extract records from Apache access logs and store those records as tab-separated values in HDFS.

Getting ready

You will need to download the apache_clf.txt dataset from the support page of the Packt website, http://www.packtpub.com/support, and place the file in HDFS.

How to do it...

Perform the following steps to transform Apache logs to TSV format using MapReduce:

  1. Build a regular expression pattern to parse the Apache combined log format:
    private Pattern p = Pattern.compile("^([\d.]+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\w+) (.+?) (.+?)" (\d+) (\d+) "([^"]+|(.+?))" "([^"]+|(.+?))"", Pattern.DOTALL);
  2. Create a mapper class to read the log files. The mapper should emit IP address as the key, and the following as values: timestamp, page, http status, bytes returned to the client, and the user agent of the client:
    public class CLFMapper extends Mapper<Object, Text, Text, Text>{
    
        private SimpleDateFormat dateFormatter = 
                new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss Z");
        private Pattern p = 
                Pattern.compile("^([\d.]+) (\S+) (\S+)"
                + " \[([\w:/]+\s[+\-]\d{4})\] "(\w+) (.+?) (.+?)" "
                + "(\d+) (\d+) "([^"]+|(.+?))" "([^"]+|(.+?))"", 
                Pattern.DOTALL);
        
        private Text outputKey = new Text();
        private Text outputValue = new Text();
        @Override
        protected void map(Object key, Text value, Context 
          context) throws IOException, InterruptedException {
            String entry = value.toString();
            Matcher m = p.matcher(entry);
            if (!m.matches()) {
                return;
            }
            Date date = null;
            try {
                date = dateFormatter.parse(m.group(4));
            } catch (ParseException ex) {
                return;
            }
            outputKey.set(m.group(1)); //ip
            StringBuilder b = new StringBuilder();
            b.append(date.getTime()); //timestamp
            b.append('	'),
            b.append(m.group(6)); //page
            b.append('	'),
            b.append(m.group(8)); //http status
            b.append('	'),
            b.append(m.group(9)); //bytes
            b.append('	'),
            b.append(m.group(12)); //useragent
            outputValue.set(b.toString());
            context.write(outputKey, outputValue);
        }
       
    }
  3. Now, create a map-only job to apply the transformation:
    public class ParseWeblogs extends Configured implements Tool {
      
      public int run(String[] args) throws Exception {
        
        Path inputPath = new Path(args[0]);
        Path outputPath = new Path(args[1]);
        
        Configuration conf = getConf();
        Job weblogJob = new Job(conf);
        weblogJob.setJobName("Weblog Transformer");
        weblogJob.setJarByClass(getClass());
        weblogJob.setNumReduceTasks(0);
        weblogJob.setMapperClass(CLFMapper.class);        
        weblogJob.setMapOutputKeyClass(Text.class);
        weblogJob.setMapOutputValueClass(Text.class);
        weblogJob.setOutputKeyClass(Text.class);
        weblogJob.setOutputValueClass(Text.class);
        weblogJob.setInputFormatClass(TextInputFormat.class);
        weblogJob.setOutputFormatClass(TextOutputFormat.class);
        
        FileInputFormat.setInputPaths(weblogJob, inputPath);
        FileOutputFormat.setOutputPath(weblogJob, outputPath);
        
        
        if(weblogJob.waitForCompletion(true)) {
          return 0;
        }
        return 1;
      }
      
      public static void main( String[] args ) throws Exception {
        int returnCode = ToolRunner.run(new ParseWeblogs(), args);
        System.exit(returnCode);
      }
      
    }
  4. Finally, launch the MapReduce job:
    $ hadoop jar myjar.jar com.packt.ch3.etl.ParseWeblogs /user/hadoop/apache_clf.txt /user/hadoop/apache_clf_tsv

How it works...

We first created a mapper that was responsible for the extraction of the desired information we from the Apache weblogs and for emitting the extracted fields in a tab-separated format.

Next, we created a map-only job to transform the web server log data into a tab-separated format. The key-value pairs emitted from the mapper were stored in a file in HDFS.

There's more...

By default, the TextOutputFormat class uses a tab to separate the key and value pairs. You can change the default separator by setting the mapred.textoutputformat.separator property. For example, to separate the IP and the timestamp by a ',', we could re-run the job using the following command:

$ hadoop jar myjar.jar com.packt.ch3.etl.ParseWeblogs -Dmapred.textoutputformat.separator=',' /user/hadoop/apache_clf.txt /user/hadoop/csv

See also

The tab-separated output from this recipe will be used in the following recipes:

  • Using Apache Pig to filter bot traffic from web server logs
  • Using Apache Pig to sort web server log data by timestamp
  • Using Apache Pig to sessionize web server log data
  • Using Python to extend Apache Pig functionality
  • Using MapReduce and secondary sort to calculate page views
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset