The number of links to a particular web page from other pages, the number of in-links, is widely considered a good metric to measure the popularity or the importance of a web page. In fact, the number of in-links to a web page and the importance of the sources of those links have become integral components of most of the popular link analysis algorithms such as PageRank introduced by Google.
In this recipe, we are going to extract the in-links information from a set of web pages fetched by Apache Nutch and stored in Apache HBase backend data store. In our MapReduce program, we first retrieve the out-links information for the set of web pages stored in the Nutch HBase data store and then use that information to calculate the in-links graph for this set of web pages. The calculated in-link graph will contain only the link information from the fetched subset of the web graph only.
Follow the Whole web crawling with Apache Nutch using an existing Hadoop/HBase cluster or the Configuring Apache HBase local mode as the backend data store for Apache Nutch recipe and crawl a set of web pages using Apache Nutch to the backend HBase data store.
This recipe requires Apache Ant for building the source code.
The following steps show you how to extract an out-links graph from the web pages stored in Nutch HBase data store and how to calculate the in-links graph using that extracted out-links graph.
$HBASE_HOME
and start the HBase shell.> bin/hbase shell
linkdata
and a column family named il
. Exit the HBase shell.hbase(main):002:0> create 'linkdata','il' 0 row(s) in 1.8360 seconds hbase(main):002:0> quit
ant build
from the Chapter 7
source directory.c7-samples.jar
file to $HADOOP_HOME.
Copy the $HBASE_HOME/hbase-*.jar
and $HBASE_HOME/lib/zookeeper-*.jar to $HADOOP_HOME/lib
.$HADOOP_HOME
.> bin/hadoop jar c7-samples.jar chapter7.InLinkGraphExtractor
linkdata
table using the following command to check the output of the MapReduce program:> bin/hbase shell hbase(main):005:0> scan 'linkdata',{COLUMNS=>'il',LIMIT=>10} ROW COLUMN+CELL ....
As we are going to use HBase to read input as well as to write the output, we will use the HBase TableMapper and TableReducer helper classes to implement our MapReduce application. We will configure the TableMapper and the TableReducer using the utility methods given in the TableMapReduceUtil class. The Scan
object is used to specify the criteria to be used by the mapper when reading the input data from the HBase data store.
Configuration conf = HBaseConfiguration.create(); Job job = new Job(conf, "InLinkGraphExtractor"); job.setJarByClass(InLinkGraphExtractor.class); Scan scan = new Scan(); scan.addFamily("ol".getBytes()); TableMapReduceUtil.initTableMapperJob("webpage", scan, ……); TableMapReduceUtil.initTableReducerJob("linkdata",……);
The map implementation receives the HBase rows as the input records. In our implementation, each of the rows corresponds to a fetched web page. The input key to the Map function consists of the web page's URL and the value consists of the web pages linked from this particular web page. Map function emits a record for each of the linked web pages, where the key of a Map output record is the URL of the linked page and the value of a Map output record is the input key to the Map function (the URL of the current processing web page).
public void map(ImmutableBytesWritable sourceWebPage, Result values,……){ List<KeyValue> results = values.list(); for (KeyValue keyValue : results) { ImmutableBytesWritable outLink = new ImmutableBytesWritable(keyValue.getQualifier()); try { context.write(outLink, sourceWebPage); } catch (InterruptedException e) { throw new IOException(e); } } }
The reduce implementation receives a web page URL as the key and a list of web pages that contain links to that web page (provided in the key) as the values. The reduce
function stores these data in to an HBase table.
public void reduce(ImmutableBytesWritable key, Iterable<ImmutableBytesWritable> values, ……{ Put put = new Put(key.get()); for (ImmutableBytesWritable immutableBytesWritable :values) { put.add(Bytes.toBytes("il"),immutableBytesWritable.get(), Bytes.toBytes("link")); } context.write(key, put); }