Formatting the results of MapReduce computations – using Hadoop OutputFormats

Often times the output of your MapReduce computation will be consumed by other applications. Hence, it is important to store the result of a MapReduce computation in a format that can be consumed efficiently by the target application. It is also important to store and organize the data in a location that is efficiently accessible by your target application. We can use Hadoop OutputFormat interface to define the data storage format, data storage location and the organization of the output data of a MapReduce computation. A OutputFormat prepares the output location and provides a RecordWriter implementation to perform the actual serialization and storage of the data.

Hadoop uses the org.apache.hadoop.mapreduce.lib.output.TextOutputFormat<K,V> as the default OutputFormat for the MapReduce computations. TextOutputFormat writes the records of the output data to plain text files in HDFS using a separate line for each record. TextOutputFormat uses the tab character to delimit between the key and the value of a record. TextOutputFormat extends FileOutputFormat, which is the base class for all file-based output formats.

How to do it...

The following steps show you how to use the FileOutputFormat based SequenceFileOutputFormat as the OutputFormat for a Hadoop MapReduce computation.

  1. In this example, we are going to specify the org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat<K,V> as the OutputFormat for a Hadoop MapReduce computation using the Job object as follows:
    Configuration conf = new Configuration();
    Job job = new Job(conf, "log-analysis");
    ……
    job.setOutputFormat(SequenceFileOutputFormat.class)
  2. Set the output paths to the job.
    FileOutputFormat.setOutputPath(job, new Path(outputPath));

How it works...

SequenceFileOutputFormat serializes the data to Hadoop Sequence files. Hadoop Sequence files store the data as binary key-value pairs and supports data compression. Sequence files are efficient specially for storing non-text data. We can use the Sequence files to store the result of a MapReduce computation, if the output of the MapReduce computation going to be the input of another Hadoop MapReduce computation.

SequenceFileOutputFormat is based on the FileOutputFormat, which is the base class for the file-based OutputFormat. Hence, we specify the output path to the MapReduce computation using the setOutputPath() method of the FileOutputFormat. We have to perform this step when using any OutputFormat that is based on the FileOutputFormat.

FileOutputFormat.setOutputPath(job, new Path(outputPath));

There's more...

You can implement custom OutputFormat classes to write the output of your MapReduce computations in a proprietary or custom data format and/or to store the result in storage other than HDFS by extending the org.apache.hadoop.mapreduce.OutputFormat<K,V> abstract class. In case your OutputFormat implementation stores the data in a filesystem, you can extend from the FileOutputFormat class to make your life easier.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset