Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Whole web crawling with Apache Nutch using a Hadoop/HBase cluster

Crawling large amount of web documents can be done efficiently by utilizing the power of a MapReduce cluster.

Getting ready

We assume you already have your Hadoop (version 1.0.x) and HBase (version 0.90.x) cluster deployed. If not, refer to the Deploying HBase on a Hadoop cluster recipe of this chapter to configure and deploy an HBase cluster on a Hadoop cluster.

How to do it

The following steps show you how to use Apache Nutch with a Hadoop MapReduce cluster and a HBase data store to perform large-scale web crawling.

Add the $HADOOP_HOME/bin directory to the PATH environment variable of your machine.
```
> export PATH=$PATH:$HADOOP_HOME/bin/
```
If you have already followed the Indexing and searching web documents using Apache Solr recipe, skip to the next step. If not, follow steps 2 to 6 of the recipe 3.
In case you have not downloaded Apache Nutch for the earlier recipes in this chapter, download Nutch from http://nutch.apache.org and extract it.

Add the following to the nutch-site.xml in the $NUTCH_HOME/conf. You can give any name to the value of the http.agent.name property, but that name should be given in the http.robots.agents property as well.

<configuration>
<property>
 <name>storage.data.store.class</name>
 <value>org.apache.gora.hbase.store.HBaseStore</value>
 <description>Default class for storing data</description>
</property>
<property>
 <name>http.agent.name</name>
 <value>NutchCrawler</value>
</property>
<property>
  <name>http.robots.agents</name>
  <value>NutchCrawler,*</value>
</property>
</configuration>

Uncomment the following in the $NUTCH_HOME/ivy/ivy.xml file:

<dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />

Add the following to the $NUTCH_HOME/conf/gora.properties file to set the HBase storage as the default Gora data store:
```
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
```
Note
You can restrict the domain names you wish to crawl by editing the following line in the conf/regex-urlfiler.txt file. Leave it unchanged for whole web crawling.
```
# accept anything else
+.
```
Execute the following commands in $NUTCH_HOME to build Nutch with HBase as the backend data storage:
```
> ant clean
> ant runtime
```
Create a directory in HDFS to upload the seed urls.
```
> bin/hadoop dfs -mkdir urls
```
Create a text file with the seed URLs for the crawl. Upload the seed URLs file to the directory created in the above step.
```
> bin/hadoop dfs -put seed.txt urls
```
Note
You can use the Open Directory project RDF dump (http://rdf.dmoz.org/) to create your seed URLs. Nutch provides a utility class to select a subset of URLs from the extracted DMOZ RDF data:
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls
Copy the $NUTCH_HOME/runtime/deploy directory to the JobTracker node of the Hadoop cluster.
Issue the following command from inside the copied deploy directory in the JobTracker node to inject the seed URLs to the Nutch database and to generate the initial fetch list.
```
> bin/nutch inject urls
> bin/nutch generate
```

Issue the following commands from inside the copied deploy directory in the JobTracker node:

> bin/nutch fetch -all
12/10/22 03:56:39 INFO fetcher.FetcherJob: FetcherJob: starting
12/10/22 03:56:39 INFO fetcher.FetcherJob: FetcherJob: fetching all
......

> bin/nutch parse -all
12/10/22 03:48:51 INFO parse.ParserJob: ParserJob: starting
......

12/10/22 03:50:44 INFO parse.ParserJob: ParserJob: success

> bin/nutch updatedb
12/10/22 03:53:10 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting
....
12/10/22 03:53:50 INFO crawl.DbUpdaterJob: DbUpdaterJob: done

> bin/nutch generate -topN 10
12/10/22 03:51:09 INFO crawl.GeneratorJob: GeneratorJob: Selecting best-scoring urls due for fetch.
12/10/22 03:51:09 INFO crawl.GeneratorJob: GeneratorJob: starting
....
12/10/22 03:51:46 INFO crawl.GeneratorJob: GeneratorJob: done
12/10/22 03:51:46 INFO crawl.GeneratorJob: GeneratorJob: generated batch id: 1350892269-603479705

Repeat the commands in step 12 as many times as needed to crawl the desired number of pages or the desired depth.
Follow the Indexing and searching fetched web documents using Apache Solr recipe to index the fetched data using Apache Solr.

How it works

All the Nutch operations we used in this recipe, including fetching and parsing, are implemented as MapReduce programs. These MapReduce programs utilize the Hadoop cluster to perform the Nutch operations in a distributed manner and use the HBase to store the data across the HDFS cluster. You can monitor these MapReduce computations through the monitoring UI (http://jobtracker_ip:50030) of your Hadoop cluster.

Apache Nutch Ant build creates a Hadoop job file containing all the dependencies in the $NUTCH_HOME/runtime/deploy folder. The bin/nutch script uses this job file to submit the MapReduce computations to Hadoop.

Table of Contents for
Whole web crawling with Apache Nutch using a Hadoop/HBase cluster

Whole web crawling with Apache Nutch using a Hadoop/HBase cluster

Getting ready

How to do it

Note

Note

How it works

See also

Table of Contents for Whole web crawling with Apache Nutch using a Hadoop/HBase cluster

Create new playlist

Sign In

Sign Up

Whole web crawling with Apache Nutch using a Hadoop/HBase cluster

Getting ready

How to do it

Note

Note

How it works

See also

Table of Contents for
Whole web crawling with Apache Nutch using a Hadoop/HBase cluster