Apache HBase data store is very useful when storing large-scale data in a semi-structured manner, so that they can be used for further processing using Hadoop MapReduce programs or to provide a random access data storage for client applications. In this recipe, we are going to import a large text dataset to HBase using the importtsv
and bulkload
tools.
Install and deploy Hadoop MapReduce and HDFS. Export the HADOOP_HOME
environment variable to point to your Hadoop installation root folder.
Install and deploy Apache HBase in the distributed mode. Refer to the Deploying HBase on a Hadoop cluster recipe in this chapter for more information. Export the HBASE_HOME
environment variable to point to your HBase installation root folder.
Install Python on your Hadoop compute nodes, if Python is not already installed.
The following steps show you how to load the TSV-converted 20news
dataset into an HBase table:
20news-cleaned
:>bin/hadoop jar ../contrib/streaming/hadoop-streaming-VERSION.jar -input 20news-all -output 20news-cleaned -mapper MailPreProcessor.py -file MailPreProcessor.py
HBASE_HOME
and start the HBase Shell:>cd $HBASE_HOME >bin/hbase shell
20news-data
by executing the following command in the HBase Shell. Older versions of the importtsv
(used in the next step) command can handle only a single column family. Hence, we are using only a single column family when creating the HBase table:hbase(main):001:0> create '20news-data,'h'
HADOOP_HOME
and execute the following command to import the preprocessed data to the previously created HBase table:> bin/hadoop jar $HBASE_HOME/hbase-<VERSION>.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,h:from,h:group,h:subj,h:msg 20news-data 20news-cleaned
HBASE_HOME
. Start the HBase Shell. Use the count
and scan
commands of the HBase Shell to verify contents of the table:hbase(main):010:0> count '20news-data' 12xxx row(s) in 0.0250 seconds hbase(main):010:0> scan '20news-data', {LIMIT => 10} ROW COLUMN+CELL <[email protected] column=h:c1, timestamp=1354028803355, value= [email protected] (Chris Katopis)> <[email protected] column=h:c2, timestamp=1354028803355, value= sci.electronics ......
The following are the steps to load the 20news
dataset to an HBase table using the bulkload
feature:
hbase(main):001:0> create '20news-bulk','h'
HADOOP_HOME
. Use the following command to generate an HBase bulkload
datafile:>bin/hadoop jar $HBASE_HOME/hbase-<VERSION>.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,h:from,h:group,h:subj,h:msg -Dimporttsv.bulk.output=hbaseloaddir 20news-bulk-source20news-cleaned
bulkload
datafiles are generated:>bin/hadoop fs -ls 20news-bulk-source ...... drwxr-xr-x - thilinasupergroup 0 2012-11-27 10:06 /user/thilina/20news-bulk-source/h >bin/hadoopfs -ls20news-bulk-source/h -rw-r--r-- 1 thilinasupergroup 19110 2012-11-27 10:06 /user/thilina/20news-bulk-source/h/4796511868534757870
>bin/hadoop jar $HBASE_HOME/hbase-<VERSION>.jar completebulkload 20news-bulk-source 20news-bulk...... 12/11/27 10:10:00 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://127.0.0.1:9000/user/thilina/20news-bulk-source/h/4796511868534757870 first= <[email protected]>last= <stephens.736002130@ngis> ......
HBASE_HOME
. Start the HBase Shell. Use the count
and scan
commands of the HBase Shell to verify the contents of the table:hbase(main):010:0> count 'datatsvbulk' hbase(main):010:0> scan 'datatsvbulk', {LIMIT => 10}
The MailPreProcessor.py
Python script extracts a selected set of data fields from the news board message and outputs them as a tab-separated dataset.
value = fromAddress + " " + newsgroup +" " + subject +" " + value print '%s %s' % (messageID, value)
We import the tab-separated dataset generated by the Streaming MapReduce computations to Hbase using the importtsv
tool. The importtsv
tool requires the data to have no other tab characters except for the tab characters separating the data fields. Hence, we remove any tab characters in the input data using the following snippet of the Python script:
line = line.strip() line = re.sub(' ',' ',line)
The importtsv
tool supports loading data to HBase directly using the Put
operations as well as by generating the HBase internal HFiles. The following command loads the data to HBase directly using the Put
operations. Our generated dataset contains a key and four fields in the values. We specify the data fields to table column name mapping for the dataset using the -Dimporttsv.columns
parameter. This mapping consists of listing the respective table column names in the order of the tab-separated data fields in the input dataset:
>bin/hadoop jar $HBASE_HOME/hbase-<VERSION>.jar importtsv -Dimporttsv.columns=<data field to table column mappings> <HBasetablename> <hdfs input directory>
We can use the following command to generate HBase HFiles for the dataset. These HFiles can be directly loaded to HBase, without going through the HBase APIs, thereby reducing the amount of CPU and network resources needed.
>bin/hadoop jar $HBASE_HOME/hbase-<VERSION>.jar importtsv -Dimporttsv.columns=<filed to column mappings> -Dimporttsv.bulk.output=<path for hfile output> <HBasetablename> <hdfs input directory>
These generated HFiles can be loaded into HBase tables by simply moving the files to the right location. This is done by using the completebulkload
command:
>bin/hadoop jar $HBASE_HOME/hbase-<VERSION>.jar completebulkload <path for hfiles> <table name>
You can use the importtsv
tool with datasets with other datafile separator characters as well by specifying the -Dimporttsv.separator
parameter. The following is an example of using a comma as the separator character to import a comma-separated dataset into a HBase table.
>bin/hadoop jar $HBASE_HOME/hbase-<VERSION>.jar importtsv '-Dimporttsv.separator=,' -Dimporttsv.columns=<data field to table column mappings> <HBasetablename><hdfs input directory>
Look out for Bad Lines in the MapReduce job console output or in the Hadoop monitoring console. One reason is having unwanted delimiter characters. In the preceding Python script, we remove any extra tabs in the message; here is the message displayed in the job console:
12/11/27 00:38:10 INFO mapred.JobClient: ImportTsv
12/11/27 00:38:10 INFO mapred.JobClient: Bad Lines=2
HBase supports storing multiple versions of column values for each record. When querying, HBase returns the latest version of values, unless we specify a specific time period. This feature of HBase can be used to perform automatic de-duplication by making sure we use the same RowKey
value for duplicate values. In our 20news
example, we use MessageID
as the RowKey
value for the records, thus ensuring that duplicate messages will appear as different versions of the same data record.
HBase allows us to configure the maximum or the minimum number of versions per column family. Setting maximum number of versions to a low value will reduce the data usage by discarding the older versions. Refer to http://hbase.apache.org/book/schema.versions.html for more information on setting the maximum or minimum number of versions.