Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Intra-domain web crawling using Apache Nutch

Web crawling is the process of visiting and downloading all or a subset of web pages on the Internet. Although the concept of crawling and implementing a simple crawler sounds simple, building a full-fledged crawler takes great deal of work. A full-fledged crawler that needs to be distributed has to obey the best practices such as not overloading servers, follow robots.txt, performing periodic crawls, prioritizing the pages to crawl, and identifying many formats of documents. Apache Nutch is an open source search engine that provides a highly scalable crawler. Apache Nutch offers features such as politeness, robustness, and scalability.

In this recipe, we are going to use Apache Nutch in the standalone mode for small-scale, intra-domain web crawling. Almost all the Nutch commands are implemented as Hadoop MapReduce applications, as you would notice when executing the steps 10 to 18 of this recipe. Nutch standalone executes these applications using the Hadoop the local mode.

Getting ready

Set the JAVA_HOME environmental variable. Install Apache Ant and add it to the PATH environmental variable.

How to do it...

The following steps show you how to use Apache Nutch in standalone mode for small scale web crawling.

Apache Nutch standalone mode uses the HyperSQL database as the default data storage. Download HyperSQL from the http://sourceforge.net/projects/hsqldb/. Unzip the distribution and go to the data directory.
```
> cd hsqldb-2.2.9/hsqldb
```
Start a HyperSQL database using the following command. The following database uses data/nutchdb.* as the database files and uses nutchdb as the database alias name. We'll be using this database alias name in the gora.sqlstore.jdbc.url property in the step 7.
```
> java -cp lib/hsqldb.jar org.hsqldb.server.Server --database.0 file:data/nutchdb --dbname.0 nutchdb
......
[Server@79616c7]: Database [index=0, id=0, db=file:data/nutchdb, alias=nutchdb] opened sucessfully in 523 ms.
......
```
Download Apache Nutch 2.X from http://nutch.apache.org/ and extract it.
Go to the extracted directory, which we will refer to as NUTCH_HOME, and build Apache Nutch using the following command:
```
> ant runtime
```
Go to the runtime/local directory and run the bin/nutch command to verify the Nutch installation. A successful installation would print out the list of Nutch commands, shown as follows:
```
> cd runtime/local
> bin/nutch
Usage: nutch COMMAND
where COMMAND is one of:…..
```

Add the following to NUTCH_HOME/runtime/local/conf/nutch-site.xml. You can give any name to the value of http.agent.name.

<configuration>
<property>
 <name>http.agent.name</name>
 <value>NutchCrawler</value>
</property>
<property>
  <name>http.robots.agents</name>
  <value>NutchCrawler,*</value>
</property>
</configuration>

You can restrict the domain names you wish to crawl by editing the NUTCH_HOME/runtime/local/conf/regex-urlfiler.txt file. For an example, in order to restrict the domain to http://apache.org,
Replace the following in the NUTCH_HOME/runtime/local/conf/regex-urlfilter.txt file:
```
# accept anything else
+.
```
Use the following regular expression:
```
+^http://([a-z0-9]*.)*apache.org/
```

Ensure that you have the following in the NUTCH_HOME/runtime/local/conf/gora.properties file. Provide the database alias named used in step 2.

###############################
# Default SqlStore properties #
###############################
gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchdb
gora.sqlstore.jdbc.user=sa

Create a directory named urls and create a file named seed.txt inside that directory. Add your seed URLs to this file. Seed URLs are used to start the crawling and would be pages that are crawled first. We use http://apache.org as the seed URL in the following example:
```
> mkdir urls
> echo http://apache.org/ > urls/seed.txt
```

Inject the seed URLs in to the Nutch database using the following command:

> bin/nutch inject urls/
InjectorJob: starting
InjectorJob: urlDir: urls
InjectorJob: finished

Use the following command to verify the injection of the seeds to the Nutch database. TOTAL urls printed by this command should match the number of URLs you had in your seed.txt file. You can use this command in the later cycles as well to get an idea about the number of web page entries in your database.
```
> bin/nutch readdb  -stats
WebTable statistics start
Statistics for WebTable:
min score:  1.0
....
TOTAL urls:  1
```
Use the following command to generate a fetch list from the injected seed URLs. This will prepare list of web pages to be fetched in the first cycle of the crawling. Generation will assign a batch ID to the current generated fetch list, which can be used in the subsequent commands.
```
> bin/nutch generate
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: done
GeneratorJob: generated batch id: 1350617353-1356796157
```
Use the following command to fetch the list of pages prepared in step 12. This step performs the actual fetching of the web pages. The –all parameter is used to inform Nutch to fetch all the generated batches.
```
> bin/nutch fetch -all
FetcherJob: starting
FetcherJob: fetching all
FetcherJob: threads: 10
......
fetching http://apache.org/
......
-activeThreads=0
FetcherJob: done
```
Use the following command to parse and to extract the useful data from fetched web pages, such as the text content of the pages, metadata of the pages, and the set of pages linked from the fetched pages. We call the set of pages linked from a fetched page as the out-links of that particular fetched page. The out-links data will be used to discover new pages to fetch as well as to rank pages using link analysis algorithms such as PageRank.
```
> bin/nutch parse -all
ParserJob: starting
......
ParserJob: success
```
Execute the following command to update the Nutch database with the data extracted in the preceding step. This step includes updating the contents of the fetched pages as well as adding new entries of the pages discovered through the links contained in the fetched pages.
```
> bin/nutch updatedb
DbUpdaterJob: starting
……
DbUpdaterJob: done
```

Execute the following command to generate a new fetch list using the information from the previously fetched data. The topN parameter limits the number of URLs generated for the next fetch cycle.

> bin/nutch generate -topN 100
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
......
GeneratorJob: done
GeneratorJob: generated batch id: 1350618261-1660124671

Fetch the new list, parse it, and update the database.

> bin/nutch fetch –all
......
> bin/nutch parse -all
......
> bin/nutch updatedb
......

Repeat the steps 16 and 17 till you get the desired number of pages or the depth.

Table of Contents for
Intra-domain web crawling using Apache Nutch

Intra-domain web crawling using Apache Nutch

Getting ready

How to do it...

See also

Table of Contents for Intra-domain web crawling using Apache Nutch

Create new playlist

Sign In

Sign Up

Intra-domain web crawling using Apache Nutch

Getting ready

How to do it...

See also

Table of Contents for
Intra-domain web crawling using Apache Nutch