Hadoop uses the Writable
interface based classes as the data types for the MapReduce computations. These data types are used throughout the MapReduce computational flow, starting with reading the input data, transferring intermediate data between Map and Reduce tasks, and finally, when writing the output data. Choosing the appropriate Writable
data types for your input, intermediate, and output data can have a large effect on the performance and the programmability of your MapReduce programs.
In order to be used as a value
data type of a MapReduce computation, a data type must implement the org.apache.hadoop.io.Writable
interface. The Writable
interface defines how Hadoop should serialize and de-serialize the values when transmitting and storing the data. In order to be used as a key
data type of a MapReduce computation, a data type must implement the org.apache.hadoop.io.WritableComparable<T>
interface. In addition to the functionality of the Writable
interface, the WritableComparable
interface further defines how to compare the keys of this type with each other for sorting purposes.
Hadoop's Writable versus Java's Serializable
Hadoop's Writable-based serialization framework provides a more efficient and customized serialization and representation of the data for MapReduce programs than using the general-purpose Java's native serialization framework. As opposed to Java's serialization, Hadoop's Writable framework does not write the type name with each object expecting all the clients of the serialized data to be aware of the types used in the serialized data. Omitting the type names makes the serialization process faster and results in compact, random accessible serialized data formats that can be easily interpreted by non-Java clients. Hadoop's Writable-based serialization also has the ability to reduce the object-creation overhead by reusing the Writable
objects, which is not possible with the Java's native serialization framework.
The following steps show you how to configure the input and output data types of your Hadoop MapReduce application:
LongWritable
, value: Text
) and output (key: Text
, value: IntWritable
) key-value pairs of your mapper using the generic-type variables.public class SampleMapper extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) … { …… } }
Text
, value: IntWritable
) and output (key: Text
, value: IntWritable
) key-value pairs of your reducer using the generic-type variables. The reducer's input key-value pair data types should match the mapper's output key-value pairs.public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { …… } }
Job
object as shown in the following code snippet. These data types will serve as the output types for both, the reducer and the mapper, unless you specifically configure the mapper output types as done in step 4.Job job = new Job(..); …. job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);
job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class);
Hadoop provides several primitive data types such as IntWritable
, LongWritable
, BooleanWritable
, FloatWritable
, and ByteWritable
, which are the Writable versions of their respective Java primitive data types. We can use these types as both, the key
types as well as the value
types.
The following are several more Hadoop built-in data types that we can use as both, the key
as well as the value
types:
The following Hadoop build-in collection data types can only be used as value
types.
ArrayWritable
: This stores an array of values belonging to a Writable
type. To use ArrayWritable
type as the value
type of a reducer's input, you need to create a subclass of ArrayWritable
to specify the type of the Writable
values stored in it.public class LongArrayWritable extends ArrayWritable { public LongArrayWritable() { super(LongWritable.class); } }
TwoDArrayWritable
: This stores a matrix of values belonging to the same Writable
type. To use the TwoDArrayWritable
type as the value
type of a reducer's input, you need to specify the type of the stored values by creating a subclass of the TwoDArrayWritable
type similar to the ArrayWritable
type.MapWritable
: This stores a map of key-value pairs. Keys and values should be of the Writable
data types.SortedMapWritable
: This stores a sorted map of key-value pairs. Keys should implement the WritableComparable
interface.