Hadoop supports a number of compression algorithms, including:
Hadoop provides Java implementations of these algorithms, and therefore, files can be easily compressed/decompressed using the FileSystem
API or MapReduce input and output formats.
However, there is a drawback to storing data in HDFS using the compression formats listed previously. These formats are not splittable. Meaning, once a file is compressed using any of the codecs that Hadoop provides, the file cannot be decompressed without the whole file being read.
To understand why this is a drawback, you must first understand how Hadoop MapReduce determines the number of mappers to launch for a given task. The number of mappers launched is roughly equal to the input size divided by dfs.block.size
(the default block size is 64 MB). The blocks of work that each mapper will receive are called input splits. For example, if the input to a MapReduce job was an uncompressed file that was 128 MB, this would probably result in two mappers being launched (128 MB/64 MB).
Since files compressed using the bzip2, gzip, and DEFLATE codecs cannot be split, the whole file must be given as a single input split to the mapper. Using the previous example, if the input to a MapReduce job was a gzip compressed file that was 128 MB, the MapReduce framework would only launch one mapper.
Now, where does LZO fit in to all of this? Well, the LZO algorithm was designed to have fast decompression speeds while having a similar compression speed as compared to DEFLATE. In addition, thanks to the hard work of the Hadoop community, LZO compressed files are splittable.
You will need to download the LZO codec implementation for Hadoop from https://github.com/kevinweil/hadoop-lzo.
Perform the following steps to set up LZO and then compress and index a text file:
lzo
and lzo-devel
packages.On Red Hat Linux, use:
# yum install liblzo-devel
On Ubuntu, use:
# apt-get install liblzo2-devel
hadoop-lzo
source, and build the project.# cd kevinweil-hadoop-lzo-6bb1b7f/ # export JAVA_HOME=/path/to/jdk/ # ./setup.sh
BUILD SUCCESSFUL
lib
folder on your cluster.# cp build/hadoop-lzo*.jar /path/to/hadoop/lib/
lib
folder on your cluster.# tar -cBf - -C build/hadoop-lzo-0.4.15/lib/native/ . | tar -xBvf - -C /path/to/hadoop/lib/native
core-site.xml
to use the LZO codec classes.<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec </value> </property> <property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property>
hadoop-env.sh
script:export HADOOP_CLASSPATH=/path/to/hadoop/lib/hadoop-lzo-X.X.XX.jar export JAVA_LIBRARY_PATH=/path/to/hadoop/lib/native/hadoop-lzo-native-lib:/path/to/hadoop/lib/native/other-native-libs
Now test the installation of the LZO library.
$ lzop weblog_entries.txt
weblog_entries.txt.lzo
file into HDFS:$ hadoop fs –put weblog_entries.txt.lzo /test/weblog_entries.txt.lzo
weblog_entries.txt.lzo
file:$ hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.DistributedLzoIndexer /test/weblog_entries.txt.lzo
You should now see two files in the /test
folder
$ hadoop fs –ls /test $ /test/weblog_entries.txt.lzo $ /test/weblog_entries.txt.lzo.index
This recipe involved a lot of steps. After we moved the LZO JAR files and native libraries into place, we updated the
io.compression.codecs
property in core-site.xml
. Both HDFS and Hadoop MapReduce share this configuration file, and the value of the
io.compression.codecs
property will be used to determine which codecs are available to the system.
Finally, we ran
DistributedLzoIndexer
. This is a MapReduce application that will read one or more LZO compressed files and index the LZO block boundaries of each file. Once this application has been run on an LZO file, the LZO file can be split and sent to multiple mappers by using the included input format LzoTextInputFormat
.
In addition to
DistributedLzoIndexer
, the Hadoop LZO library also includes a class named LzoIndexer
.
LzoIndexer
launches a standalone application to index LZO files in HDFS. To index the weblog_entries.txt.lzo
in HDFS, run the following command:
$ hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.LzoIndexer /test/weblog_entries.txt.lzo