There can be use cases where none of the built-in data types matches your requirements or a custom data type optimized for your use case may perform better than a Hadoop built-in data type. In such scenarios, we can easily write a custom Writable
data type by implementing the org.apache.hadoop.io.Writable
interface to define the serialization format of your data type. The Writable
interface-based types can be used as value
types in Hadoop MapReduce computations.
In this recipe, we implement a sample Hadoop Writable
data type for HTTP server log entries. For the purpose of this sample, we consider that a log entry consists of the five fields—request host, timestamp, request URL, response size, and the http status code. The following is a sample log entry:
192.168.0.2 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
You can download a sample HTTP server log data set from ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz.
The following are the steps to implement a custom Hadoop Writable data type for the HTTP server log entries:
LogWritable
class implementing the org.apache.hadoop.io.Writable
interface.public class LogWritable implements Writable{ private Text userIP, timestamp, request; privateIntWritableresponseSize, status; public LogWritable() { this.userIP = new Text(); this.timestamp= new Text(); this.request = new Text(); this.responseSize = new IntWritable(); this.status = new IntWritable(); } public void readFields(DataInput in) throws IOException { userIP.readFields(in); timestamp.readFields(in); request.readFields(in); responseSize.readFields(in); status.readFields(in); } public void write(DataOutput out) throws IOException { userIP.write(out); timestamp.write(out); request.write(out); responseSize.write(out); status.write(out); } ……… // getters and setters for the fields }
LogWritable
type as a value
type in your MapReduce computation. In the following example, we use the LogWritable
type as the Map output value type.public class LogProcessorMap extends Mapper<LongWritable, Text, Text, LogWritable> { …. } public class LogProcessorReduce extends Reducer<Text, LogWritable, Text, IntWritable> { public void reduce(Text key, Iterable<LogWritable> values, Context context) { …… } }
Job job = new Job(..); …. job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(LogWritable.class);
The Writable
interface consists of the two methods, readFields()
and write()
. Inside the readFields()
method, we de-serialize the input data and populate the fields of the Writable
object.
public void readFields(DataInput in) throws IOException { userIP.readFields(in); timestamp.readFields(in); request.readFields(in); responseSize.readFields(in); status.readFields(in); }
In the preceding example, we use the Writable
types as the fields of our custom Writable
type and use the readFields()
method of the fields for de-serializing the data from the DataInput
object. It is also possible to use java primitive data types as the fields of the Writable
type and to use the corresponding read methods of the DataInput
object to read the values from the underlying stream, as shown in the following code snippet:
int responseSize = in.readInt(); String userIP = in.readUTF();
Inside the write
()
method, we write the fields of the Writable
object to the underlying stream.
public void write(DataOutput out) throws IOException { userIP.write(out); timestamp.write(out); request.write(out); responseSize.write(out); status.write(out); }
In case you are using Java primitive data types as the fields of the Writable
object, you can use the corresponding write methods of the DataOutput
object to write the values to the underlying stream as below.
out.writeInt(responseSize); out.writeUTF(userIP);
Be cautious about the following issues when implementing your custom Writable
data type:
Writable
class, make sure to retain the default empty constructor.TextOutputFormat
uses the toString()
method to serialize the key
and value
types. In case you are using the TextOutputFormat
to serialize instances of your custom Writable
type, make sure to have a meaningful toString()
implementation for your custom Writable
data type.Writable
class repeatedly. You should not rely on the existing state of the object when populating it inside the readFields()
method.