Web crawling is the process of visiting and downloading all or a subset of web pages on the Internet. Although the concept of crawling and implementing a simple crawler sounds simple, building a full-fledged crawler takes great deal of work. A full-fledged crawler that needs to be distributed has to obey the best practices such as not overloading servers, follow robots.txt
, performing periodic crawls, prioritizing the pages to crawl, and identifying many formats of documents. Apache Nutch is an open source search engine that provides a highly scalable crawler. Apache Nutch offers features such as politeness, robustness, and scalability.
In this recipe, we are going to use Apache Nutch in the standalone mode for small-scale, intra-domain web crawling. Almost all the Nutch commands are implemented as Hadoop MapReduce applications, as you would notice when executing the steps 10 to 18 of this recipe. Nutch standalone executes these applications using the Hadoop the local mode.
Set the JAVA_HOME
environmental variable. Install Apache Ant and add it to the PATH
environmental variable.
The following steps show you how to use Apache Nutch in standalone mode for small scale web crawling.
> cd hsqldb-2.2.9/hsqldb
data/nutchdb.*
as the database files and uses nutchdb
as the database alias name. We'll be using this database alias name in the gora.sqlstore.jdbc.url
property in the step 7.> java -cp lib/hsqldb.jar org.hsqldb.server.Server --database.0 file:data/nutchdb --dbname.0 nutchdb ...... [Server@79616c7]: Database [index=0, id=0, db=file:data/nutchdb, alias=nutchdb] opened sucessfully in 523 ms. ......
NUTCH_HOME
, and build Apache Nutch using the following command:> ant runtime
runtime/local
directory and run the bin/nutch
command to verify the Nutch installation. A successful installation would print out the list of Nutch commands, shown as follows:> cd runtime/local > bin/nutch Usage: nutch COMMAND where COMMAND is one of:…..
NUTCH_HOME/runtime/local/conf/nutch-site.xml
. You can give any name to the value of http.agent.name
.<configuration> <property> <name>http.agent.name</name> <value>NutchCrawler</value> </property> <property> <name>http.robots.agents</name> <value>NutchCrawler,*</value> </property> </configuration>
NUTCH_HOME/runtime/local/conf/regex-urlfiler.txt
file. For an example, in order to restrict the domain to http://apache.org,Replace the following in the NUTCH_HOME/runtime/local/conf/regex-urlfilter.txt
file:
# accept anything else +.
Use the following regular expression:
+^http://([a-z0-9]*.)*apache.org/
NUTCH_HOME/runtime/local/conf/gora.properties
file. Provide the database alias named used in step 2.############################### # Default SqlStore properties # ############################### gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchdb gora.sqlstore.jdbc.user=sa
urls
and create a file named seed.txt
inside that directory. Add your seed URLs to this file. Seed URLs are used to start the crawling and would be pages that are crawled first. We use http://apache.org as the seed URL in the following example:> mkdir urls > echo http://apache.org/ > urls/seed.txt
> bin/nutch inject urls/ InjectorJob: starting InjectorJob: urlDir: urls InjectorJob: finished
TOTAL urls
printed by this command should match the number of URLs you had in your seed.txt
file. You can use this command in the later cycles as well to get an idea about the number of web page entries in your database.> bin/nutch readdb -stats WebTable statistics start Statistics for WebTable: min score: 1.0 .... TOTAL urls: 1
> bin/nutch generate GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: true GeneratorJob: done GeneratorJob: generated batch id: 1350617353-1356796157
–all
parameter is used to inform Nutch to fetch all the generated batches.> bin/nutch fetch -all FetcherJob: starting FetcherJob: fetching all FetcherJob: threads: 10 ...... fetching http://apache.org/ ...... -activeThreads=0 FetcherJob: done
> bin/nutch parse -all ParserJob: starting ...... ParserJob: success
> bin/nutch updatedb DbUpdaterJob: starting …… DbUpdaterJob: done
> bin/nutch generate -topN 100 GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting ...... GeneratorJob: done GeneratorJob: generated batch id: 1350618261-1660124671
> bin/nutch fetch –all ...... > bin/nutch parse -all ...... > bin/nutch updatedb ......
The Whole web crawling with Apache Nutch using a Hadoop/HBase cluster and Indexing and searching web documents using Apache Solr recipes of this chapter.
Refer to http://www.hsqldb.org/doc/2.0/guide/index.html for more information on using HyperSQL.