Implementing a custom Hadoop Writable data type

There can be use cases where none of the built-in data types matches your requirements or a custom data type optimized for your use case may perform better than a Hadoop built-in data type. In such scenarios, we can easily write a custom Writable data type by implementing the org.apache.hadoop.io.Writable interface to define the serialization format of your data type. The Writable interface-based types can be used as value types in Hadoop MapReduce computations.

In this recipe, we implement a sample Hadoop Writable data type for HTTP server log entries. For the purpose of this sample, we consider that a log entry consists of the five fields—request host, timestamp, request URL, response size, and the http status code. The following is a sample log entry:

192.168.0.2 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245

You can download a sample HTTP server log data set from ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz.

How to do it...

The following are the steps to implement a custom Hadoop Writable data type for the HTTP server log entries:

  1. Write a new LogWritable class implementing the org.apache.hadoop.io.Writable interface.
    public class LogWritable implements Writable{
    
      private Text userIP, timestamp, request;  
      privateIntWritableresponseSize, status;  
    
      public LogWritable() {
        this.userIP = new Text();
        this.timestamp=  new Text();
        this.request = new Text();
        this.responseSize = new IntWritable();
        this.status = new IntWritable();    
      }
      public void readFields(DataInput in) throws IOException {
        userIP.readFields(in);
        timestamp.readFields(in);
        request.readFields(in);
        responseSize.readFields(in);
        status.readFields(in);
      }
    
      public void write(DataOutput out) throws IOException {
        userIP.write(out);
        timestamp.write(out);
        request.write(out);
        responseSize.write(out);
        status.write(out);
      }
    
    ……… // getters and setters for the fields
    }
  2. Use the new LogWritable type as a value type in your MapReduce computation. In the following example, we use the LogWritable type as the Map output value type.
    public class LogProcessorMap extends Mapper<LongWritable, 
    Text, Text, LogWritable> {    
    ….
    }
    
    public class LogProcessorReduce extends Reducer<Text, 
    LogWritable, Text, IntWritable> {
    
    public void reduce(Text key, 
    Iterable<LogWritable> values, Context context) {
         ……  }
    }
  3. Configure the output types of the job accordingly.
    Job job = new Job(..);
    ….  
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(LogWritable.class);

How it works...

The Writable interface consists of the two methods, readFields() and write(). Inside the readFields() method, we de-serialize the input data and populate the fields of the Writable object.

  public void readFields(DataInput in) throws IOException {
    userIP.readFields(in);
    timestamp.readFields(in);
    request.readFields(in);
    responseSize.readFields(in);
    status.readFields(in);
  }

In the preceding example, we use the Writable types as the fields of our custom Writable type and use the readFields() method of the fields for de-serializing the data from the DataInput object. It is also possible to use java primitive data types as the fields of the Writable type and to use the corresponding read methods of the DataInput object to read the values from the underlying stream, as shown in the following code snippet:

int responseSize = in.readInt();
String userIP = in.readUTF();

Inside the write () method, we write the fields of the Writable object to the underlying stream.

  public void write(DataOutput out) throws IOException {
    userIP.write(out);
    timestamp.write(out);
    request.write(out);
    responseSize.write(out);
    status.write(out);
  }

In case you are using Java primitive data types as the fields of the Writable object, you can use the corresponding write methods of the DataOutput object to write the values to the underlying stream as below.

out.writeInt(responseSize);
out.writeUTF(userIP);

There's more...

Be cautious about the following issues when implementing your custom Writable data type:

  • In case you are adding a custom constructor to your custom Writable class, make sure to retain the default empty constructor.
  • TextOutputFormat uses the toString() method to serialize the key and value types. In case you are using the TextOutputFormat to serialize instances of your custom Writable type, make sure to have a meaningful toString() implementation for your custom Writable data type.
  • While reading the input data, Hadoop may reuse an instance of the Writable class repeatedly. You should not rely on the existing state of the object when populating it inside the readFields() method.

See also

  • Implementing a custom Hadoop key type
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset