Configuring Apache HBase as the backend data store for Apache Nutch

Apache Nutch integrates Apache Gora to add support for different backend data stores. In this recipe, we are going to configure Apache HBase as the backend data storage for Apache Nutch. Similarly, it is possible to plug in data stores such as RDBMS databases, Cassandra and others through Gora.

Getting ready

Set the JAVA_HOME environmental variable.

Install Apache Ant and add it to the PATH environmental variable.

How to do it

The following steps show you how to configure Apache HBase local mode as the backend data store for Apache Nutch to store the crawled data.

  1. Download and install Apache HBase. Apache Nutch 4.1 and Apache Gora 0.2 recommend HBase 0.90.4 or the later versions of the 0.90.x branch.
  2. Create two directories to store the HDFS data and Zookeeper data. Add the following to the $HBASE_HOME/conf/hbase-site.xml file, replacing the values with the paths to the two directories:
    <configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>file:///u/software/hbase-0.90.6/hbase-data</value>
      </property>
    <property>
        <name>hbase.zookeeper.property.dataDir</name>
        <value>file:///u/software/hbase-0.90.6/zookeeper-data</value>
      </property>
    </configuration>

    Refer to the Installing HBase recipe in Chapter 5, Hadoop Ecosystem, for more information on how to install HBase in the local mode. Test your HBase installation using the HBase shell before proceeding (step 6 of the Installing HBase recipe.)

  3. In case you have not downloaded Apache Nutch for the earlier recipes in this chapter, download Nutch from the http://nutch.apache.org and extract it.
  4. Add the following to the $NUTCH_HOME/conf/nutch-site.xml file.
    <property>
     <name>storage.data.store.class</name>
     <value>org.apache.gora.hbase.store.HBaseStore</value>
     <description>Default class for storing data</description>
    </property>
  5. Uncomment the following in the $NUTCH_HOME/ivy/ivy.xml file.
    <dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />
  6. Add the following to the $NUTCH_HOME/conf/gora.properties file to set the HBase storage as the default Gora data store.
    gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
  7. Execute the following commands in the $NUTCH_HOME to build Apache Nutch with HBase as the back end data storage.
    > ant clean
    > ant runtime
    
  8. Follow steps 4 to 17 of the Intra-domain web crawling using Apache Nutch recipe.
  9. Start the Hbase shell and issue the following commands to view the fetched data.
    > bin/hbase shell
    HBase Shell; enter 'help<RETURN>' for list of supported commands.
    Type "exit<RETURN>" to leave the HBase Shell
    Version 0.90.6, r1295128, Wed Feb 29 14:29:21 UTC 2012
    hbase(main):001:0> list
    TABLE                                                                                               
    webpage                                                                                             
    1 row(s) in 0.4970 seconds
    
    hbase(main):002:0> count 'webpage'
    Current count: 1000, row: org.apache.bval:http/release-management.html                              
    Current count: 2000, row: org.apache.james:http/jspf/index.html                                     
    Current count: 3000, row: org.apache.sqoop:http/team-list.html                                      
    Current count: 4000, row: org.onesocialweb:http/                                                    
    4065 row(s) in 1.2870 seconds
    
    hbase(main):005:0> scan 'webpage',{STARTROW => 'org.apache.nutch:http/', LIMIT=>10}
    ROW                                   COLUMN+CELL                                                                                                 
     org.apache.nutch:http/               column=f:bas, timestamp=1350800142780, value=http://nutch.apache.org/                                       
     org.apache.nutch:http/               column=f:cnt, timestamp=1350800142780, value=<....
    ......
    10 row(s) in 0.5160 seconds
    
  10. Follow the steps in the Indexing and searching web documents using Apache Solr to index recipe and search the fetched data using Apache Solr.

How it works...

The preceding steps configure and run Apache Nutch using Apache HBase as the storage backend. When configured, Nutch stores the fetched web page data and other metadata in HBase tables. In this recipe we use a standalone HBase deployment. However, as shown in the Whole web crawling with Apache Nutch using a Hadoop/HBase cluster recipe of this chapter, Nutch can be used with a distributed HBase deployment as well. Usage of HBase as the backend data store provides more scalability and performance for Nutch crawling.

See also

  • The Installing HBase recipe of Chapter 5, Hadoop Ecosystem, and the Deploying HBase on a Hadoop cluster recipe of this chapter.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset