Choosing appropriate Hadoop data types

Hadoop uses the Writable interface based classes as the data types for the MapReduce computations. These data types are used throughout the MapReduce computational flow, starting with reading the input data, transferring intermediate data between Map and Reduce tasks, and finally, when writing the output data. Choosing the appropriate Writable data types for your input, intermediate, and output data can have a large effect on the performance and the programmability of your MapReduce programs.

In order to be used as a value data type of a MapReduce computation, a data type must implement the org.apache.hadoop.io.Writable interface. The Writable interface defines how Hadoop should serialize and de-serialize the values when transmitting and storing the data. In order to be used as a key data type of a MapReduce computation, a data type must implement the org.apache.hadoop.io.WritableComparable<T> interface. In addition to the functionality of the Writable interface, the WritableComparable interface further defines how to compare the keys of this type with each other for sorting purposes.

Tip

Hadoop's Writable versus Java's Serializable

Hadoop's Writable-based serialization framework provides a more efficient and customized serialization and representation of the data for MapReduce programs than using the general-purpose Java's native serialization framework. As opposed to Java's serialization, Hadoop's Writable framework does not write the type name with each object expecting all the clients of the serialized data to be aware of the types used in the serialized data. Omitting the type names makes the serialization process faster and results in compact, random accessible serialized data formats that can be easily interpreted by non-Java clients. Hadoop's Writable-based serialization also has the ability to reduce the object-creation overhead by reusing the Writable objects, which is not possible with the Java's native serialization framework.

How to do it...

The following steps show you how to configure the input and output data types of your Hadoop MapReduce application:

  1. Specify the data types for the input (key: LongWritable, value: Text) and output (key: Text, value: IntWritable) key-value pairs of your mapper using the generic-type variables.
    public class SampleMapper extends Mapper<LongWritable, Text, Text, IntWritable> {    
    
    public void map(LongWritable key, Text value, 
        Context context) … {
    ……  }
    }
  2. Specify the data types for the input (key: Text, value: IntWritable) and output (key: Text, value: IntWritable) key-value pairs of your reducer using the generic-type variables. The reducer's input key-value pair data types should match the mapper's output key-value pairs.
    public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
    
    public void reduce(Text key, 
        Iterable<IntWritable> values, Context context) {
      ……  }
    }
  3. Specify the output data types of the MapReduce computation using the Job object as shown in the following code snippet. These data types will serve as the output types for both, the reducer and the mapper, unless you specifically configure the mapper output types as done in step 4.
    Job job = new Job(..);
    ….  
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
  4. Optionally, you can configure the different data types for the mapper's output key-value pairs using the following steps, when your mapper and reducer have different data types for the output key-value pairs.
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(IntWritable.class);

There's more...

Hadoop provides several primitive data types such as IntWritable , LongWritable , BooleanWritable , FloatWritable , and ByteWritable , which are the Writable versions of their respective Java primitive data types. We can use these types as both, the key types as well as the value types.

The following are several more Hadoop built-in data types that we can use as both, the key as well as the value types:

  • Text: This stores a UTF8 text
  • BytesWritable: This stores a sequence of bytes
  • VIntWritable and VLongWritable: These store variable length integer and long values
  • NullWritable: This is a zero-length Writable type that can be used when you don't want to use a key or value type

The following Hadoop build-in collection data types can only be used as value types.

  • ArrayWritable: This stores an array of values belonging to a Writable type. To use ArrayWritable type as the value type of a reducer's input, you need to create a subclass of ArrayWritable to specify the type of the Writable values stored in it.
    public class LongArrayWritable extends ArrayWritable { 
      public LongArrayWritable() { 
      super(LongWritable.class); 
      }
    }
  • TwoDArrayWritable: This stores a matrix of values belonging to the same Writable type. To use the TwoDArrayWritable type as the value type of a reducer's input, you need to specify the type of the stored values by creating a subclass of the TwoDArrayWritable type similar to the ArrayWritable type.
  • MapWritable: This stores a map of key-value pairs. Keys and values should be of the Writable data types.
  • SortedMapWritable: This stores a sorted map of key-value pairs. Keys should implement the WritableComparable interface.

See also

  • Implementing a custom Hadoop Writable data type
  • Implementing a custom Hadoop key type
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset