Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Configuring Apache HBase as the backend data store for Apache Nutch

Apache Nutch integrates Apache Gora to add support for different backend data stores. In this recipe, we are going to configure Apache HBase as the backend data storage for Apache Nutch. Similarly, it is possible to plug in data stores such as RDBMS databases, Cassandra and others through Gora.

Getting ready

Set the JAVA_HOME environmental variable.

Install Apache Ant and add it to the PATH environmental variable.

How to do it

The following steps show you how to configure Apache HBase local mode as the backend data store for Apache Nutch to store the crawled data.

Download and install Apache HBase. Apache Nutch 4.1 and Apache Gora 0.2 recommend HBase 0.90.4 or the later versions of the 0.90.x branch.
Create two directories to store the HDFS data and Zookeeper data. Add the following to the $HBASE_HOME/conf/hbase-site.xml file, replacing the values with the paths to the two directories:
```
<configuration>
<property>
    <name>hbase.rootdir</name>
    <value>file:///u/software/hbase-0.90.6/hbase-data</value>
  </property>
<property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>file:///u/software/hbase-0.90.6/zookeeper-data</value>
  </property>
</configuration>
```
Refer to the Installing HBase recipe in Chapter 5, Hadoop Ecosystem, for more information on how to install HBase in the local mode. Test your HBase installation using the HBase shell before proceeding (step 6 of the Installing HBase recipe.)
In case you have not downloaded Apache Nutch for the earlier recipes in this chapter, download Nutch from the http://nutch.apache.org and extract it.

Add the following to the $NUTCH_HOME/conf/nutch-site.xml file.

<property>
 <name>storage.data.store.class</name>
 <value>org.apache.gora.hbase.store.HBaseStore</value>
 <description>Default class for storing data</description>
</property>

Uncomment the following in the $NUTCH_HOME/ivy/ivy.xml file.

<dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />

Add the following to the $NUTCH_HOME/conf/gora.properties file to set the HBase storage as the default Gora data store.
```
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
```
Execute the following commands in the $NUTCH_HOME to build Apache Nutch with HBase as the back end data storage.
```
> ant clean
> ant runtime
```
Follow steps 4 to 17 of the Intra-domain web crawling using Apache Nutch recipe.

Start the Hbase shell and issue the following commands to view the fetched data.

> bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.90.6, r1295128, Wed Feb 29 14:29:21 UTC 2012
hbase(main):001:0> list
TABLE                                                                                               
webpage                                                                                             
1 row(s) in 0.4970 seconds

hbase(main):002:0> count 'webpage'
Current count: 1000, row: org.apache.bval:http/release-management.html                              
Current count: 2000, row: org.apache.james:http/jspf/index.html                                     
Current count: 3000, row: org.apache.sqoop:http/team-list.html                                      
Current count: 4000, row: org.onesocialweb:http/                                                    
4065 row(s) in 1.2870 seconds

hbase(main):005:0> scan 'webpage',{STARTROW => 'org.apache.nutch:http/', LIMIT=>10}
ROW                                   COLUMN+CELL                                                                                                 
 org.apache.nutch:http/               column=f:bas, timestamp=1350800142780, value=http://nutch.apache.org/                                       
 org.apache.nutch:http/               column=f:cnt, timestamp=1350800142780, value=<....
......
10 row(s) in 0.5160 seconds

Follow the steps in the Indexing and searching web documents using Apache Solr to index recipe and search the fetched data using Apache Solr.

How it works...

The preceding steps configure and run Apache Nutch using Apache HBase as the storage backend. When configured, Nutch stores the fetched web page data and other metadata in HBase tables. In this recipe we use a standalone HBase deployment. However, as shown in the Whole web crawling with Apache Nutch using a Hadoop/HBase cluster recipe of this chapter, Nutch can be used with a distributed HBase deployment as well. Usage of HBase as the backend data store provides more scalability and performance for Nutch crawling.

Table of Contents for
Configuring Apache HBase as the backend data store for Apache Nutch

Configuring Apache HBase as the backend data store for Apache Nutch

Getting ready

How to do it

How it works...

See also

Table of Contents for Configuring Apache HBase as the backend data store for Apache Nutch

Create new playlist

Sign In

Sign Up

Configuring Apache HBase as the backend data store for Apache Nutch

Getting ready

How to do it

How it works...

See also

Table of Contents for
Configuring Apache HBase as the backend data store for Apache Nutch