Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Transforming Apache logs into TSV format using MapReduce

MapReduce is an excellent tool for transforming data into tab-separated values (TSV). Once the input data is loaded into HDFS, the entire Hadoop cluster can be utilized to transform large datasets in parallel. This recipe will demonstrate the method to extract records from Apache access logs and store those records as tab-separated values in HDFS.

Getting ready

You will need to download the apache_clf.txt dataset from the support page of the Packt website, http://www.packtpub.com/support, and place the file in HDFS.

How to do it...

Perform the following steps to transform Apache logs to TSV format using MapReduce:

Build a regular expression pattern to parse the Apache combined log format:

private Pattern p = Pattern.compile("^([\d.]+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\w+) (.+?) (.+?)" (\d+) (\d+) "([^"]+|(.+?))" "([^"]+|(.+?))"", Pattern.DOTALL);

Create a mapper class to read the log files. The mapper should emit IP address as the key, and the following as values: timestamp, page, http status, bytes returned to the client, and the user agent of the client:

public class CLFMapper extends Mapper<Object, Text, Text, Text>{

    private SimpleDateFormat dateFormatter = 
            new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss Z");
    private Pattern p = 
            Pattern.compile("^([\d.]+) (\S+) (\S+)"
            + " \[([\w:/]+\s[+\-]\d{4})\] "(\w+) (.+?) (.+?)" "
            + "(\d+) (\d+) "([^"]+|(.+?))" "([^"]+|(.+?))"", 
            Pattern.DOTALL);
    
    private Text outputKey = new Text();
    private Text outputValue = new Text();
    @Override
    protected void map(Object key, Text value, Context 
      context) throws IOException, InterruptedException {
        String entry = value.toString();
        Matcher m = p.matcher(entry);
        if (!m.matches()) {
            return;
        }
        Date date = null;
        try {
            date = dateFormatter.parse(m.group(4));
        } catch (ParseException ex) {
            return;
        }
        outputKey.set(m.group(1)); //ip
        StringBuilder b = new StringBuilder();
        b.append(date.getTime()); //timestamp
        b.append('	'),
        b.append(m.group(6)); //page
        b.append('	'),
        b.append(m.group(8)); //http status
        b.append('	'),
        b.append(m.group(9)); //bytes
        b.append('	'),
        b.append(m.group(12)); //useragent
        outputValue.set(b.toString());
        context.write(outputKey, outputValue);
    }
   
}

Now, create a map-only job to apply the transformation:

public class ParseWeblogs extends Configured implements Tool {
  
  public int run(String[] args) throws Exception {
    
    Path inputPath = new Path(args[0]);
    Path outputPath = new Path(args[1]);
    
    Configuration conf = getConf();
    Job weblogJob = new Job(conf);
    weblogJob.setJobName("Weblog Transformer");
    weblogJob.setJarByClass(getClass());
    weblogJob.setNumReduceTasks(0);
    weblogJob.setMapperClass(CLFMapper.class);        
    weblogJob.setMapOutputKeyClass(Text.class);
    weblogJob.setMapOutputValueClass(Text.class);
    weblogJob.setOutputKeyClass(Text.class);
    weblogJob.setOutputValueClass(Text.class);
    weblogJob.setInputFormatClass(TextInputFormat.class);
    weblogJob.setOutputFormatClass(TextOutputFormat.class);
    
    FileInputFormat.setInputPaths(weblogJob, inputPath);
    FileOutputFormat.setOutputPath(weblogJob, outputPath);
    
    
    if(weblogJob.waitForCompletion(true)) {
      return 0;
    }
    return 1;
  }
  
  public static void main( String[] args ) throws Exception {
    int returnCode = ToolRunner.run(new ParseWeblogs(), args);
    System.exit(returnCode);
  }
  
}

Finally, launch the MapReduce job:

$ hadoop jar myjar.jar com.packt.ch3.etl.ParseWeblogs /user/hadoop/apache_clf.txt /user/hadoop/apache_clf_tsv

How it works...

We first created a mapper that was responsible for the extraction of the desired information we from the Apache weblogs and for emitting the extracted fields in a tab-separated format.

Next, we created a map-only job to transform the web server log data into a tab-separated format. The key-value pairs emitted from the mapper were stored in a file in HDFS.

There's more...

By default, the TextOutputFormat class uses a tab to separate the key and value pairs. You can change the default separator by setting the mapred.textoutputformat.separator property. For example, to separate the IP and the timestamp by a ',', we could re-run the job using the following command:

$ hadoop jar myjar.jar com.packt.ch3.etl.ParseWeblogs -Dmapred.textoutputformat.separator=',' /user/hadoop/apache_clf.txt /user/hadoop/csv

Table of Contents for
Transforming Apache logs into TSV format using MapReduce

Transforming Apache logs into TSV format using MapReduce

Getting ready

How to do it...

How it works...

There's more...

See also

Table of Contents for Transforming Apache logs into TSV format using MapReduce

Create new playlist

Sign In

Sign Up

Transforming Apache logs into TSV format using MapReduce

Getting ready

How to do it...

How it works...

There's more...

See also

Table of Contents for
Transforming Apache logs into TSV format using MapReduce