Often times the output of your MapReduce computation will be consumed by other applications. Hence, it is important to store the result of a MapReduce computation in a format that can be consumed efficiently by the target application. It is also important to store and organize the data in a location that is efficiently accessible by your target application. We can use Hadoop OutputFormat
interface to define the data storage format, data storage location and the organization of the output data of a MapReduce computation. A OutputFormat
prepares the output location and provides a RecordWriter
implementation to perform the actual serialization and storage of the data.
Hadoop uses the org.apache.hadoop.mapreduce.lib.output.TextOutputFormat<K,V>
as the default OutputFormat
for the MapReduce computations. TextOutputFormat
writes the records of the output data to plain text files in HDFS using a separate line for each record. TextOutputFormat
uses the tab character to delimit between the key and the value of a record. TextOutputFormat
extends FileOutputFormat
, which is the base class for all file-based output formats.
The following steps show you how to use the FileOutputFormat
based SequenceFileOutputFormat
as the OutputFormat
for a Hadoop MapReduce computation.
org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat<K,V>
as the OutputFormat
for a Hadoop MapReduce computation using the Job
object as follows:Configuration conf = new Configuration(); Job job = new Job(conf, "log-analysis"); …… job.setOutputFormat(SequenceFileOutputFormat.class)
FileOutputFormat.setOutputPath(job, new Path(outputPath));
SequenceFileOutputFormat
serializes the data to Hadoop Sequence files. Hadoop Sequence files store the data as binary key-value pairs and supports data compression. Sequence files are efficient specially for storing non-text data. We can use the Sequence files to store the result of a MapReduce computation, if the output of the MapReduce computation going to be the input of another Hadoop MapReduce computation.
SequenceFileOutputFormat
is based on the FileOutputFormat
, which is the base class for the file-based OutputFormat
. Hence, we specify the output path to the MapReduce computation using the setOutputPath()
method of the FileOutputFormat
. We have to perform this step when using any OutputFormat
that is based on the FileOutputFormat
.
FileOutputFormat.setOutputPath(job, new Path(outputPath));
You can implement custom OutputFormat
classes to write the output of your MapReduce computations in a proprietary or custom data format and/or to store the result in storage other than HDFS by extending the org.apache.hadoop.mapreduce.OutputFormat<K,V>
abstract class. In case your OutputFormat
implementation stores the data in a filesystem, you can extend from the FileOutputFormat
class to make your life easier.