Whole web crawling with Apache Nutch using a Hadoop/HBase cluster

Crawling large amount of web documents can be done efficiently by utilizing the power of a MapReduce cluster.

Getting ready

We assume you already have your Hadoop (version 1.0.x) and HBase (version 0.90.x) cluster deployed. If not, refer to the Deploying HBase on a Hadoop cluster recipe of this chapter to configure and deploy an HBase cluster on a Hadoop cluster.

How to do it

The following steps show you how to use Apache Nutch with a Hadoop MapReduce cluster and a HBase data store to perform large-scale web crawling.

  1. Add the $HADOOP_HOME/bin directory to the PATH environment variable of your machine.
    > export PATH=$PATH:$HADOOP_HOME/bin/
    
  2. If you have already followed the Indexing and searching web documents using Apache Solr recipe, skip to the next step. If not, follow steps 2 to 6 of the recipe 3.
  3. In case you have not downloaded Apache Nutch for the earlier recipes in this chapter, download Nutch from http://nutch.apache.org and extract it.
  4. Add the following to the nutch-site.xml in the $NUTCH_HOME/conf. You can give any name to the value of the http.agent.name property, but that name should be given in the http.robots.agents property as well.
    <configuration>
    <property>
     <name>storage.data.store.class</name>
     <value>org.apache.gora.hbase.store.HBaseStore</value>
     <description>Default class for storing data</description>
    </property>
    <property>
     <name>http.agent.name</name>
     <value>NutchCrawler</value>
    </property>
    <property>
      <name>http.robots.agents</name>
      <value>NutchCrawler,*</value>
    </property>
    </configuration>
  5. Uncomment the following in the $NUTCH_HOME/ivy/ivy.xml file:
    <dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />
  6. Add the following to the $NUTCH_HOME/conf/gora.properties file to set the HBase storage as the default Gora data store:
    gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

    Note

    You can restrict the domain names you wish to crawl by editing the following line in the conf/regex-urlfiler.txt file. Leave it unchanged for whole web crawling.

    # accept anything else
    +.
  7. Execute the following commands in $NUTCH_HOME to build Nutch with HBase as the backend data storage:
    > ant clean
    > ant runtime
    
  8. Create a directory in HDFS to upload the seed urls.
    > bin/hadoop dfs -mkdir urls
    
  9. Create a text file with the seed URLs for the crawl. Upload the seed URLs file to the directory created in the above step.
    > bin/hadoop dfs -put seed.txt urls
    

    Note

    You can use the Open Directory project RDF dump (http://rdf.dmoz.org/) to create your seed URLs. Nutch provides a utility class to select a subset of URLs from the extracted DMOZ RDF data:

    bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls

  10. Copy the $NUTCH_HOME/runtime/deploy directory to the JobTracker node of the Hadoop cluster.
  11. Issue the following command from inside the copied deploy directory in the JobTracker node to inject the seed URLs to the Nutch database and to generate the initial fetch list.
    > bin/nutch inject urls
    > bin/nutch generate
    
  12. Issue the following commands from inside the copied deploy directory in the JobTracker node:
    > bin/nutch fetch -all
    12/10/22 03:56:39 INFO fetcher.FetcherJob: FetcherJob: starting
    12/10/22 03:56:39 INFO fetcher.FetcherJob: FetcherJob: fetching all
    ......
    
    > bin/nutch parse -all
    12/10/22 03:48:51 INFO parse.ParserJob: ParserJob: starting
    ......
    
    12/10/22 03:50:44 INFO parse.ParserJob: ParserJob: success
    
    > bin/nutch updatedb
    12/10/22 03:53:10 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting
    ....
    12/10/22 03:53:50 INFO crawl.DbUpdaterJob: DbUpdaterJob: done
    
    > bin/nutch generate -topN 10
    12/10/22 03:51:09 INFO crawl.GeneratorJob: GeneratorJob: Selecting best-scoring urls due for fetch.
    12/10/22 03:51:09 INFO crawl.GeneratorJob: GeneratorJob: starting
    ....
    12/10/22 03:51:46 INFO crawl.GeneratorJob: GeneratorJob: done
    12/10/22 03:51:46 INFO crawl.GeneratorJob: GeneratorJob: generated batch id: 1350892269-603479705
    
  13. Repeat the commands in step 12 as many times as needed to crawl the desired number of pages or the desired depth.
  14. Follow the Indexing and searching fetched web documents using Apache Solr recipe to index the fetched data using Apache Solr.

How it works

All the Nutch operations we used in this recipe, including fetching and parsing, are implemented as MapReduce programs. These MapReduce programs utilize the Hadoop cluster to perform the Nutch operations in a distributed manner and use the HBase to store the data across the HDFS cluster. You can monitor these MapReduce computations through the monitoring UI (http://jobtracker_ip:50030) of your Hadoop cluster.

Apache Nutch Ant build creates a Hadoop job file containing all the dependencies in the $NUTCH_HOME/runtime/deploy folder. The bin/nutch script uses this job file to submit the MapReduce computations to Hadoop.

See also

The Intra-domain crawling using Apache Nutch recipe of this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset