Loading large datasets to an Apache HBase data store using importtsv and bulkload tools

Apache HBase data store is very useful when storing large-scale data in a semi-structured manner, so that they can be used for further processing using Hadoop MapReduce programs or to provide a random access data storage for client applications. In this recipe, we are going to import a large text dataset to HBase using the importtsv and bulkload tools.

Getting ready

Install and deploy Hadoop MapReduce and HDFS. Export the HADOOP_HOME environment variable to point to your Hadoop installation root folder.

Install and deploy Apache HBase in the distributed mode. Refer to the Deploying HBase on a Hadoop cluster recipe in this chapter for more information. Export the HBASE_HOME environment variable to point to your HBase installation root folder.

Install Python on your Hadoop compute nodes, if Python is not already installed.

How to do it…

The following steps show you how to load the TSV-converted 20news dataset into an HBase table:

  1. Follow the Data extract, cleaning, and format conversion using Hadoop Streaming and Python recipe to perform the preprocessing of data for this recipe. We assume that the output of the following fourth step of that recipe is stored in a HDFS folder named 20news-cleaned:
    >bin/hadoop jar 
        ../contrib/streaming/hadoop-streaming-VERSION.jar 
        -input 20news-all 
        -output 20news-cleaned 
        -mapper MailPreProcessor.py 
        -file MailPreProcessor.py
    
  2. Go to HBASE_HOME and start the HBase Shell:
    >cd $HBASE_HOME
    >bin/hbase shell
    
  3. Create a table named 20news-data by executing the following command in the HBase Shell. Older versions of the importtsv (used in the next step) command can handle only a single column family. Hence, we are using only a single column family when creating the HBase table:
    hbase(main):001:0> create '20news-data,'h'
    
  4. Go to HADOOP_HOME and execute the following command to import the preprocessed data to the previously created HBase table:
    > bin/hadoop jar 
      $HBASE_HOME/hbase-<VERSION>.jar importtsv 
      -Dimporttsv.columns=HBASE_ROW_KEY,h:from,h:group,h:subj,h:msg 
      20news-data 20news-cleaned
    
  5. Go to the HBASE_HOME. Start the HBase Shell. Use the count and scan commands of the HBase Shell to verify contents of the table:
    hbase(main):010:0> count '20news-data'           
    12xxx row(s) in 0.0250 seconds
    
    hbase(main):010:0> scan '20news-data', {LIMIT => 10}
    ROW                                       COLUMN+CELL
    <[email protected] column=h:c1,
    timestamp=1354028803355, value= [email protected]
    (Chris Katopis)>
    <[email protected] column=h:c2,
    timestamp=1354028803355, value= sci.electronics
    ......
    

The following are the steps to load the 20news dataset to an HBase table using the bulkload feature:

  1. Follow steps 1 to 3, but create the table with a different name:
    hbase(main):001:0> create '20news-bulk','h'
    
  2. Go to HADOOP_HOME. Use the following command to generate an HBase bulkload datafile:
    >bin/hadoop jar 
      $HBASE_HOME/hbase-<VERSION>.jar importtsv
      -Dimporttsv.columns=HBASE_ROW_KEY,h:from,h:group,h:subj,h:msg 
      -Dimporttsv.bulk.output=hbaseloaddir 
      20news-bulk-source20news-cleaned
    
  3. List the files to verify that the bulkload datafiles are generated:
    >bin/hadoop fs -ls 20news-bulk-source
    ......
    drwxr-xr-x   - thilinasupergroup          0 2012-11-27 10:06 /user/thilina/20news-bulk-source/h
    
    >bin/hadoopfs -ls20news-bulk-source/h
    -rw-r--r--   1 thilinasupergroup      19110 2012-11-27 10:06 /user/thilina/20news-bulk-source/h/4796511868534757870
    
  4. The following command loads the data to the HBase table by moving the output files to the correct location:
    >bin/hadoop jar $HBASE_HOME/hbase-<VERSION>.jar 
    completebulkload 20news-bulk-source 20news-bulk......
    12/11/27 10:10:00 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://127.0.0.1:9000/user/thilina/20news-bulk-source/h/4796511868534757870 first= <[email protected]>last= <stephens.736002130@ngis>
    ......
    
  5. Go to HBASE_HOME. Start the HBase Shell. Use the count and scan commands of the HBase Shell to verify the contents of the table:
    hbase(main):010:0> count 'datatsvbulk'             
    hbase(main):010:0> scan 'datatsvbulk', {LIMIT => 10}
    

How it works...

The MailPreProcessor.py Python script extracts a selected set of data fields from the news board message and outputs them as a tab-separated dataset.

value = fromAddress + "	" + newsgroup 
+"	" + subject +"	" + value
print '%s	%s' % (messageID, value)

We import the tab-separated dataset generated by the Streaming MapReduce computations to Hbase using the importtsv tool. The importtsv tool requires the data to have no other tab characters except for the tab characters separating the data fields. Hence, we remove any tab characters in the input data using the following snippet of the Python script:

line = line.strip()
line = re.sub('	',' ',line)

The importtsv tool supports loading data to HBase directly using the Put operations as well as by generating the HBase internal HFiles. The following command loads the data to HBase directly using the Put operations. Our generated dataset contains a key and four fields in the values. We specify the data fields to table column name mapping for the dataset using the -Dimporttsv.columns parameter. This mapping consists of listing the respective table column names in the order of the tab-separated data fields in the input dataset:

>bin/hadoop jar 
  $HBASE_HOME/hbase-<VERSION>.jar importtsv 
   -Dimporttsv.columns=<data field to table column mappings> 
   <HBasetablename> <hdfs input directory>

We can use the following command to generate HBase HFiles for the dataset. These HFiles can be directly loaded to HBase, without going through the HBase APIs, thereby reducing the amount of CPU and network resources needed.

>bin/hadoop jar
  $HBASE_HOME/hbase-<VERSION>.jar importtsv 
  -Dimporttsv.columns=<filed to column mappings> 
  -Dimporttsv.bulk.output=<path for hfile output> 
  <HBasetablename> <hdfs input directory>

These generated HFiles can be loaded into HBase tables by simply moving the files to the right location. This is done by using the completebulkload command:

>bin/hadoop jar $HBASE_HOME/hbase-<VERSION>.jar 
completebulkload <path for hfiles> <table name>

There's more...

You can use the importtsv tool with datasets with other datafile separator characters as well by specifying the -Dimporttsv.separator parameter. The following is an example of using a comma as the separator character to import a comma-separated dataset into a HBase table.

>bin/hadoop jar 
  $HBASE_HOME/hbase-<VERSION>.jar importtsv 
  '-Dimporttsv.separator=,' 
  -Dimporttsv.columns=<data field to table column mappings> 
  <HBasetablename><hdfs input directory>

Look out for Bad Lines in the MapReduce job console output or in the Hadoop monitoring console. One reason is having unwanted delimiter characters. In the preceding Python script, we remove any extra tabs in the message; here is the message displayed in the job console:

12/11/27 00:38:10 INFO mapred.JobClient: ImportTsv

12/11/27 00:38:10 INFO mapred.JobClient: Bad Lines=2

Data de-duplication using HBase

HBase supports storing multiple versions of column values for each record. When querying, HBase returns the latest version of values, unless we specify a specific time period. This feature of HBase can be used to perform automatic de-duplication by making sure we use the same RowKey value for duplicate values. In our 20news example, we use MessageID as the RowKey value for the records, thus ensuring that duplicate messages will appear as different versions of the same data record.

HBase allows us to configure the maximum or the minimum number of versions per column family. Setting maximum number of versions to a low value will reduce the data usage by discarding the older versions. Refer to http://hbase.apache.org/book/schema.versions.html for more information on setting the maximum or minimum number of versions.

See also

  • Installing HBase in Chapter 5, Hadoop Ecosystem.
  • Running MapReduce jobs on HBASE(table input/output) in Chapter 5, Hadoop Ecosystem.
  • Deploying HBase on a Hadoop cluster.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset