Crawling large amount of web documents can be done efficiently by utilizing the power of a MapReduce cluster.
We assume you already have your Hadoop (version 1.0.x) and HBase (version 0.90.x) cluster deployed. If not, refer to the Deploying HBase on a Hadoop cluster recipe of this chapter to configure and deploy an HBase cluster on a Hadoop cluster.
The following steps show you how to use Apache Nutch with a Hadoop MapReduce cluster and a HBase data store to perform large-scale web crawling.
$HADOOP_HOME/bin
directory to the PATH
environment variable of your machine.> export PATH=$PATH:$HADOOP_HOME/bin/
nutch-site.xml
in the $NUTCH_HOME/conf
. You can give any name to the value of the http.agent.name
property, but that name should be given in the http.robots.agents
property as well.<configuration> <property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property> <property> <name>http.agent.name</name> <value>NutchCrawler</value> </property> <property> <name>http.robots.agents</name> <value>NutchCrawler,*</value> </property> </configuration>
$NUTCH_HOME/ivy/ivy.xml file
:<dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />
$NUTCH_HOME/conf/gora.properties
file to set the HBase storage as the default Gora data store:gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
$NUTCH_HOME
to build Nutch with HBase as the backend data storage:> ant clean > ant runtime
> bin/hadoop dfs -mkdir urls
> bin/hadoop dfs -put seed.txt urls
You can use the Open Directory project RDF dump (http://rdf.dmoz.org/) to create your seed URLs. Nutch provides a utility class to select a subset of URLs from the extracted DMOZ RDF data:
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset
5000 > dmoz/urls
$NUTCH_HOME/runtime/deploy
directory to the JobTracker node of the Hadoop cluster.> bin/nutch inject urls > bin/nutch generate
> bin/nutch fetch -all 12/10/22 03:56:39 INFO fetcher.FetcherJob: FetcherJob: starting 12/10/22 03:56:39 INFO fetcher.FetcherJob: FetcherJob: fetching all ...... > bin/nutch parse -all 12/10/22 03:48:51 INFO parse.ParserJob: ParserJob: starting ...... 12/10/22 03:50:44 INFO parse.ParserJob: ParserJob: success > bin/nutch updatedb 12/10/22 03:53:10 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting .... 12/10/22 03:53:50 INFO crawl.DbUpdaterJob: DbUpdaterJob: done > bin/nutch generate -topN 10 12/10/22 03:51:09 INFO crawl.GeneratorJob: GeneratorJob: Selecting best-scoring urls due for fetch. 12/10/22 03:51:09 INFO crawl.GeneratorJob: GeneratorJob: starting .... 12/10/22 03:51:46 INFO crawl.GeneratorJob: GeneratorJob: done 12/10/22 03:51:46 INFO crawl.GeneratorJob: GeneratorJob: generated batch id: 1350892269-603479705
All the Nutch operations we used in this recipe, including fetching and parsing, are implemented as MapReduce programs. These MapReduce programs utilize the Hadoop cluster to perform the Nutch operations in a distributed manner and use the HBase to store the data across the HDFS cluster. You can monitor these MapReduce computations through the monitoring UI (http://jobtracker_ip:50030) of your Hadoop cluster.
Apache Nutch Ant build creates a Hadoop job file containing all the dependencies in the $NUTCH_HOME/runtime/deploy
folder. The bin/nutch
script uses this job file to submit the MapReduce computations to Hadoop.