Generating the in-links graph for crawled web pages

The number of links to a particular web page from other pages, the number of in-links, is widely considered a good metric to measure the popularity or the importance of a web page. In fact, the number of in-links to a web page and the importance of the sources of those links have become integral components of most of the popular link analysis algorithms such as PageRank introduced by Google.

In this recipe, we are going to extract the in-links information from a set of web pages fetched by Apache Nutch and stored in Apache HBase backend data store. In our MapReduce program, we first retrieve the out-links information for the set of web pages stored in the Nutch HBase data store and then use that information to calculate the in-links graph for this set of web pages. The calculated in-link graph will contain only the link information from the fetched subset of the web graph only.

Getting ready

Follow the Whole web crawling with Apache Nutch using an existing Hadoop/HBase cluster or the Configuring Apache HBase local mode as the backend data store for Apache Nutch recipe and crawl a set of web pages using Apache Nutch to the backend HBase data store.

This recipe requires Apache Ant for building the source code.

How to do it

The following steps show you how to extract an out-links graph from the web pages stored in Nutch HBase data store and how to calculate the in-links graph using that extracted out-links graph.

  1. Go to $HBASE_HOME and start the HBase shell.
    > bin/hbase shell
    
  2. Create an HBase table with the name linkdata and a column family named il. Exit the HBase shell.
    hbase(main):002:0> create 'linkdata','il'
    0 row(s) in 1.8360 seconds
    hbase(main):002:0> quit
    
  3. Unzip the source package for this chapter and compile it by executing ant build from the Chapter 7 source directory.
  4. Copy the c7-samples.jar file to $HADOOP_HOME. Copy the $HBASE_HOME/hbase-*.jar and $HBASE_HOME/lib/zookeeper-*.jar to $HADOOP_HOME/lib.
  5. Run the Hadoop program by issuing the following command from $HADOOP_HOME.
    > bin/hadoop jar c7-samples.jar chapter7.InLinkGraphExtractor
    
  6. Start the HBase shell and scan the linkdata table using the following command to check the output of the MapReduce program:
    > bin/hbase shell
    hbase(main):005:0> scan 'linkdata',{COLUMNS=>'il',LIMIT=>10}
    ROW                            COLUMN+CELL                    
    ....
    

How it works

As we are going to use HBase to read input as well as to write the output, we will use the HBase TableMapper and TableReducer helper classes to implement our MapReduce application. We will configure the TableMapper and the TableReducer using the utility methods given in the TableMapReduceUtil class. The Scan object is used to specify the criteria to be used by the mapper when reading the input data from the HBase data store.

Configuration conf = HBaseConfiguration.create();
Job job = new Job(conf, "InLinkGraphExtractor");
job.setJarByClass(InLinkGraphExtractor.class);
Scan scan = new Scan();
scan.addFamily("ol".getBytes());
TableMapReduceUtil.initTableMapperJob("webpage", scan, ……);
TableMapReduceUtil.initTableReducerJob("linkdata",……);

The map implementation receives the HBase rows as the input records. In our implementation, each of the rows corresponds to a fetched web page. The input key to the Map function consists of the web page's URL and the value consists of the web pages linked from this particular web page. Map function emits a record for each of the linked web pages, where the key of a Map output record is the URL of the linked page and the value of a Map output record is the input key to the Map function (the URL of the current processing web page).

public void map(ImmutableBytesWritable sourceWebPage, Result values,……){
  List<KeyValue> results = values.list();      
  for (KeyValue keyValue : results) {
    ImmutableBytesWritable outLink = new     ImmutableBytesWritable(keyValue.getQualifier());
    try {
      context.write(outLink, sourceWebPage);
    } catch (InterruptedException e) {
      throw new IOException(e);
    }
  }      
}

The reduce implementation receives a web page URL as the key and a list of web pages that contain links to that web page (provided in the key) as the values. The reduce function stores these data in to an HBase table.

public void reduce(ImmutableBytesWritable key,
  Iterable<ImmutableBytesWritable> values, ……{
  
Put put = new Put(key.get());
  for (ImmutableBytesWritable immutableBytesWritable :values)   {
put.add(Bytes.toBytes("il"),immutableBytesWritable.get(), 
                                              Bytes.toBytes("link"));
  }
  context.write(key, put);
}

See also

  • The Running MapReduce jobs on HBase(table input/output) recipe of Chapter 5, Hadoop Ecosystem.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset