Apache Nutch integrates Apache Gora to add support for different backend data stores. In this recipe, we are going to configure Apache HBase as the backend data storage for Apache Nutch. Similarly, it is possible to plug in data stores such as RDBMS databases, Cassandra and others through Gora.
Set the JAVA_HOME
environmental variable.
Install Apache Ant and add it to the PATH
environmental variable.
The following steps show you how to configure Apache HBase local mode as the backend data store for Apache Nutch to store the crawled data.
$HBASE_HOME/conf/hbase-site.xml
file, replacing the values with the paths to the two directories:<configuration> <property> <name>hbase.rootdir</name> <value>file:///u/software/hbase-0.90.6/hbase-data</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>file:///u/software/hbase-0.90.6/zookeeper-data</value> </property> </configuration>
Refer to the Installing HBase recipe in Chapter 5, Hadoop Ecosystem, for more information on how to install HBase in the local mode. Test your HBase installation using the HBase shell before proceeding (step 6 of the Installing HBase recipe.)
$NUTCH_HOME/conf/nutch-site.xml
file.<property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property>
$NUTCH_HOME/ivy/ivy.xml
file.<dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />
$NUTCH_HOME/conf/gora.properties
file to set the HBase storage as the default Gora data store.gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
$NUTCH_HOME
to build Apache Nutch with HBase as the back end data storage.> ant clean > ant runtime
> bin/hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.90.6, r1295128, Wed Feb 29 14:29:21 UTC 2012 hbase(main):001:0> list TABLE webpage 1 row(s) in 0.4970 seconds hbase(main):002:0> count 'webpage' Current count: 1000, row: org.apache.bval:http/release-management.html Current count: 2000, row: org.apache.james:http/jspf/index.html Current count: 3000, row: org.apache.sqoop:http/team-list.html Current count: 4000, row: org.onesocialweb:http/ 4065 row(s) in 1.2870 seconds hbase(main):005:0> scan 'webpage',{STARTROW => 'org.apache.nutch:http/', LIMIT=>10} ROW COLUMN+CELL org.apache.nutch:http/ column=f:bas, timestamp=1350800142780, value=http://nutch.apache.org/ org.apache.nutch:http/ column=f:cnt, timestamp=1350800142780, value=<.... ...... 10 row(s) in 0.5160 seconds
The preceding steps configure and run Apache Nutch using Apache HBase as the storage backend. When configured, Nutch stores the fetched web page data and other metadata in HBase tables. In this recipe we use a standalone HBase deployment. However, as shown in the Whole web crawling with Apache Nutch using a Hadoop/HBase cluster recipe of this chapter, Nutch can be used with a distributed HBase deployment as well. Usage of HBase as the backend data store provides more scalability and performance for Nutch crawling.