- Import the graphx related classes:
scala> import org.apache.spark.graphx._
- Load the edges from Amazon S3:
scala> val edgesFile = sc.textFile(
"s3a://com.infoobjects.wiki/links",20)
The links file has links in the sourcelink: link1 link2 ... format.
- Flatten and convert edgesFile into an RDD of link1, link2 format, and then convert it into an RDD of Edge objects:
scala> val edges = edgesFile.flatMap { line =>
val links = line.split("\W+")
val from = links(0)
val to = links.tail
for ( link <- to) yield (from,link)
}.map( e => Edge(e._1.toLong,e._2.toLong,1))
Use the paste to copy multi-line code, and then execute using Ctrl+D.
- Load the edges from Amazon S3:
scala> val verticesFile = sc.textFile
("s3a://com.infoobjects.wiki/nodes",20)
- Provide an index to the vertices, and then swap it to make it in the (index, title) format:
scala> val vertices = verticesFile.zipWithIndex.map(_.swap)
- Create the graph object:
scala> val graph = Graph(vertices,edges)
- Run pageRank function, and get the vertices:
scala> val ranks = graph.pageRank(0.001).vertices
- As ranks are in the (vertex ID, pagerank) format, swap it to make it in the (pagerank, vertex ID) format:
scala> val swappedRanks = ranks.map(_.swap)
- Sort to get the highest ranked pages first:
scala> val sortedRanks = swappedRanks.sortByKey(false)
- Get the highest ranked page:
scala> val highest = sortedRanks.first
- The preceding command gives the vertex ID, which you still have to look up to see the actual title with the rank. Let's do a join:
scala> val join = sortedRanks.join(vertices)
- Sort the joined RDD again after converting from the (vertex ID, (page rank, title)) format to the (page rank, (vertex ID, title)) format:
scala> val result = join.map ( v => (v._2._1,
(v._1,v._2._2))).sortByKey(false)
- Print the top five ranked pages:
scala> result.take(5).collect.foreach(println)
- Here's what the output should be:
(12406.054646736622,
(5302153,United_States'_Country_Reports_on_Human_Rights_Practices))
(7925.094429748747,(84707,2007,_Canada_budget))
(7635.6564216408515,(88822,2008,_Madrid_plane_crash))
(7041.479913258444,(1921890,Geographic_coordinates))
(5675.169862343964,(5300058,United_Kingdom's))