The network centrality and PageRank

Previously, we have used the degree distribution and clustering coefficients of a network to understand how connected a network is. In particular, we have learned how to find the largest connected components and the nodes that have the highest degree. Then, we visualized the networks and saw the nodes that have higher chances to play the role of hubs in the network since many nodes are connected to them. In some sense, the degree of a node can be interpreted as a centrality measure that determines how important that node is relative to the rest of the network. In this section, we are going to introduce a different centrality measure as well as the PageRank algorithm, which is useful for ranking nodes in graphs.

Note

There exist many other measures of centrality for graphs. For example, betweenness centrality is useful for information flow problems. Given a node, its betweenness is the number of shortest paths from all vertices to all others that pass through this node. In contrast to PageRank, betweenness centrality assigns a high score to the nodes that are strategically connected on the shortest paths, connecting the pairs of other nodes. Other measures are the connected centrality and Katz centrality. There are no predefined algorithms in GraphX to compute these measures. One of the reasons is the greater complexity required for exactly computing the betweenness centrality. Therefore, approximate algorithms still need to be developed and will be an excellent open source contribution for extending the current GraphX library.

How PageRank works

PageRank is the famous algorithm behind Google's incredibly successful web search engine. In response to each search query, Google wants to display important web pages first. In brief, PageRank assigns a probability score to each page. The higher the score for a node, the more likely a user will land on that page in the long term.

To find the final PageRank scores, the algorithm simulates the behavior of a random surfer by walking her through the web graph. At each step, the surfer can either visit a page that it links to or jump to another random page (this is not necessarily a neighboring page). This is done according to the transition probabilities that are specified by the structure of the graph. For example, a web graph with one thousand nodes will be associated to a 1000 by 1000 transition probability matrix. The element in row i and column j of that matrix has a value of 1/k where the j page has k outgoing links, and one of them is to the i page. Otherwise, it is zero. The PageRank algorithm starts at a random node and, at each step, the PageRank scores are updated.

A sketch implementation of this algorithm is shown as follows:

var PR = Array.fill(n)(1.0)
val oldPR = Array.fill(n)(1.0)
while( iter <= maxIter || max(abs(PR - oldPr)) > tol) {
  swap(oldPR, PR)
  for(i <- 0 until n) {
    PR[i] = d + (1 - d) * inNbrs[i].map(j => oldPR[j] / outDeg[j]).sum
  }
}

In the preceding code, alpha is the random reset probability with a default value of 0.15. Next, inNbrs[i] is the set of neighbors, which link to i, and outDeg[j] is the out-degree of the j vertex.

The first term in the update is due to the fact that the surfer can choose to skip the neighbor and instead jump to a random page with a probability as d. The second term updates the important score of each page, based on the previous scores of the neighbors that link to the page. This process is repeated until the PageRank scores converge to a fixed value, or until a maximum number of iterations are reached.

In GraphX, there are two implementations of the PageRank algorithm. The first implementation uses the Pregel interface and runs PageRank for a fixed number of iterations numIter. The second one uses the standalone Graph interface and runs PageRank until the change in PageRank score is smaller than a specific error tolerance tol.

Note

These PageRank algorithms exploit data-parallelization over vertices. In particular, the Pregel implementation relies on local message passing for updating the PageRank scores. Another point to note is that the PageRank scores that are returned are not normalized. Thus, they do not represent a probability distribution. Moreover, pages that have no incoming links will have a PageRank score of alpha. Nonetheless, the top pages can be still be found by sorting the vertices of the returned PageRank graph by their score attribute.

Ranking web pages

Here, we will use a new dataset for demonstrating PageRank. The first one is a web graph of pages from the University of Notre Dame. Directed edges represent hyperlinks between them. Using PageRank, we will rank and find the most important pages.

The dataset can be downloaded from http://snap.stanford.edu/data/web-NotreDame.html, which was first used by (Albert, Jeong & Barabasi, 1999):

// Load web graph
val webGraph = GraphLoader.edgeListFile(sc,"./data/web- NotreDame.txt")

// Run PageRank with an error tolerance of 0.0001
val ranks = webGraph.pageRank(0.001).vertices

// Find the top 10 pages of University of Notre Dame 
val topPages = ranks.sortBy(_._2, false).take(10) 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset