Intra-domain web crawling using Apache Nutch

Web crawling is the process of visiting and downloading all or a subset of web pages on the Internet. Although the concept of crawling and implementing a simple crawler sounds simple, building a full-fledged crawler takes great deal of work. A full-fledged crawler that needs to be distributed has to obey the best practices such as not overloading servers, follow robots.txt, performing periodic crawls, prioritizing the pages to crawl, and identifying many formats of documents. Apache Nutch is an open source search engine that provides a highly scalable crawler. Apache Nutch offers features such as politeness, robustness, and scalability.

In this recipe, we are going to use Apache Nutch in the standalone mode for small-scale, intra-domain web crawling. Almost all the Nutch commands are implemented as Hadoop MapReduce applications, as you would notice when executing the steps 10 to 18 of this recipe. Nutch standalone executes these applications using the Hadoop the local mode.

Getting ready

Set the JAVA_HOME environmental variable. Install Apache Ant and add it to the PATH environmental variable.

How to do it...

The following steps show you how to use Apache Nutch in standalone mode for small scale web crawling.

  1. Apache Nutch standalone mode uses the HyperSQL database as the default data storage. Download HyperSQL from the http://sourceforge.net/projects/hsqldb/. Unzip the distribution and go to the data directory.
    > cd hsqldb-2.2.9/hsqldb
    
  2. Start a HyperSQL database using the following command. The following database uses data/nutchdb.* as the database files and uses nutchdb as the database alias name. We'll be using this database alias name in the gora.sqlstore.jdbc.url property in the step 7.
    > java -cp lib/hsqldb.jar org.hsqldb.server.Server --database.0 file:data/nutchdb --dbname.0 nutchdb
    ......
    [Server@79616c7]: Database [index=0, id=0, db=file:data/nutchdb, alias=nutchdb] opened sucessfully in 523 ms.
    ......
    
  3. Download Apache Nutch 2.X from http://nutch.apache.org/ and extract it.
  4. Go to the extracted directory, which we will refer to as NUTCH_HOME, and build Apache Nutch using the following command:
    > ant runtime
    
  5. Go to the runtime/local directory and run the bin/nutch command to verify the Nutch installation. A successful installation would print out the list of Nutch commands, shown as follows:
    > cd runtime/local
    > bin/nutch
    Usage: nutch COMMAND
    where COMMAND is one of:…..
    
  6. Add the following to NUTCH_HOME/runtime/local/conf/nutch-site.xml. You can give any name to the value of http.agent.name.
    <configuration>
    <property>
     <name>http.agent.name</name>
     <value>NutchCrawler</value>
    </property>
    <property>
      <name>http.robots.agents</name>
      <value>NutchCrawler,*</value>
    </property>
    </configuration>
  7. You can restrict the domain names you wish to crawl by editing the NUTCH_HOME/runtime/local/conf/regex-urlfiler.txt file. For an example, in order to restrict the domain to http://apache.org,

    Replace the following in the NUTCH_HOME/runtime/local/conf/regex-urlfilter.txt file:

    # accept anything else
    +.

    Use the following regular expression:

    +^http://([a-z0-9]*.)*apache.org/
  8. Ensure that you have the following in the NUTCH_HOME/runtime/local/conf/gora.properties file. Provide the database alias named used in step 2.
    ###############################
    # Default SqlStore properties #
    ###############################
    gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
    gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchdb
    gora.sqlstore.jdbc.user=sa
  9. Create a directory named urls and create a file named seed.txt inside that directory. Add your seed URLs to this file. Seed URLs are used to start the crawling and would be pages that are crawled first. We use http://apache.org as the seed URL in the following example:
    > mkdir urls
    > echo http://apache.org/ > urls/seed.txt
    
  10. Inject the seed URLs in to the Nutch database using the following command:
    > bin/nutch inject urls/
    InjectorJob: starting
    InjectorJob: urlDir: urls
    InjectorJob: finished
    
  11. Use the following command to verify the injection of the seeds to the Nutch database. TOTAL urls printed by this command should match the number of URLs you had in your seed.txt file. You can use this command in the later cycles as well to get an idea about the number of web page entries in your database.
    > bin/nutch readdb  -stats
    WebTable statistics start
    Statistics for WebTable:
    min score:  1.0
    ....
    TOTAL urls:  1
    
  12. Use the following command to generate a fetch list from the injected seed URLs. This will prepare list of web pages to be fetched in the first cycle of the crawling. Generation will assign a batch ID to the current generated fetch list, which can be used in the subsequent commands.
    > bin/nutch generate
    GeneratorJob: Selecting best-scoring urls due for fetch.
    GeneratorJob: starting
    GeneratorJob: filtering: true
    GeneratorJob: done
    GeneratorJob: generated batch id: 1350617353-1356796157
    
  13. Use the following command to fetch the list of pages prepared in step 12. This step performs the actual fetching of the web pages. The –all parameter is used to inform Nutch to fetch all the generated batches.
    > bin/nutch fetch -all
    FetcherJob: starting
    FetcherJob: fetching all
    FetcherJob: threads: 10
    ......
    fetching http://apache.org/
    ......
    -activeThreads=0
    FetcherJob: done
    
  14. Use the following command to parse and to extract the useful data from fetched web pages, such as the text content of the pages, metadata of the pages, and the set of pages linked from the fetched pages. We call the set of pages linked from a fetched page as the out-links of that particular fetched page. The out-links data will be used to discover new pages to fetch as well as to rank pages using link analysis algorithms such as PageRank.
    > bin/nutch parse -all
    ParserJob: starting
    ......
    ParserJob: success
    
  15. Execute the following command to update the Nutch database with the data extracted in the preceding step. This step includes updating the contents of the fetched pages as well as adding new entries of the pages discovered through the links contained in the fetched pages.
    > bin/nutch updatedb
    DbUpdaterJob: starting
    ……
    DbUpdaterJob: done
    
  16. Execute the following command to generate a new fetch list using the information from the previously fetched data. The topN parameter limits the number of URLs generated for the next fetch cycle.
    > bin/nutch generate -topN 100
    GeneratorJob: Selecting best-scoring urls due for fetch.
    GeneratorJob: starting
    ......
    GeneratorJob: done
    GeneratorJob: generated batch id: 1350618261-1660124671
    
  17. Fetch the new list, parse it, and update the database.
    > bin/nutch fetch –all
    ......
    > bin/nutch parse -all
    ......
    > bin/nutch updatedb
    ......
    
  18. Repeat the steps 16 and 17 till you get the desired number of pages or the depth.

See also

The Whole web crawling with Apache Nutch using a Hadoop/HBase cluster and Indexing and searching web documents using Apache Solr recipes of this chapter.

Refer to http://www.hsqldb.org/doc/2.0/guide/index.html for more information on using HyperSQL.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset