How to do it...

  1. Import the graphx related classes:
        scala> import org.apache.spark.graphx._
  1. Load the edges from Amazon S3:
        scala> val edgesFile = sc.textFile(
"s3a://com.infoobjects.wiki/links",20)
The links file has links in the sourcelink: link1 link2 ... format.
  1. Flatten and convert edgesFile into an RDD of link1, link2 format, and then convert it into an RDD of Edge objects:
        scala> val edges = edgesFile.flatMap { line =>
val links = line.split("\W+")
val from = links(0)
val to = links.tail
for ( link <- to) yield (from,link)
}.map( e => Edge(e._1.toLong,e._2.toLong,1))
Use the paste to copy multi-line code, and then execute using Ctrl+D.
  1.  Load the edges from Amazon S3:
        scala> val verticesFile = sc.textFile
("s3a://com.infoobjects.wiki/nodes",20)
  1. Provide an index to the vertices, and then swap it to make it in the (index, title) format:
        scala> val vertices = verticesFile.zipWithIndex.map(_.swap)
  1. Create the graph object:
        scala> val graph = Graph(vertices,edges)
  1. Run pageRank function, and get the vertices:
        scala> val ranks = graph.pageRank(0.001).vertices
  1. As ranks are in the (vertex ID, pagerank) format, swap it to make it in the (pagerank, vertex ID) format:
        scala> val swappedRanks = ranks.map(_.swap)
  1. Sort to get the highest ranked pages first:
        scala> val sortedRanks = swappedRanks.sortByKey(false)
  1. Get the highest ranked page:
        scala> val highest = sortedRanks.first
  1. The preceding command gives the vertex ID, which you still have to look up to see the actual title with the rank. Let's do a join:
        scala> val join = sortedRanks.join(vertices)
  1. Sort the joined RDD again after converting from the (vertex ID, (page rank, title)) format to the (page rank, (vertex ID, title)) format:
        scala> val result = join.map ( v => (v._2._1, 
(v._1,v._2._2))).sortByKey(false)
  1. Print the top five ranked pages:
        scala> result.take(5).collect.foreach(println)
  1. Here's what the output should be:
    (12406.054646736622,         
(5302153,United_States'_Country_Reports_on_Human_Rights_Practices))
(7925.094429748747,(84707,2007,_Canada_budget))
(7635.6564216408515,(88822,2008,_Madrid_plane_crash))
(7041.479913258444,(1921890,Geographic_coordinates))
(5675.169862343964,(5300058,United_Kingdom's))
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset