Chapter 5. Built-in algorithms

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5. Built-in algorithms

This chapter covers

Algorithms that come with the GraphX API
Detecting clusters within graphs: PageRank, Shortest Paths, Connected Components, Label Propagation
Measuring connectedness of a graph or subgraph with Triangle Count
Measuring the connectedness of a subset of users in a social network graph and finding isolated populations

In chapter 4 you learned about the foundational GraphX APIs that enable you to write your own custom algorithms. But there’s no need for you to reinvent the wheel in cases where the GraphX API already provides an implemented standard algorithm. In this chapter, we describe some of those basic algorithms and discuss which situations they can be used in:

PageRank
Personalized PageRank
Triangle Count
Shortest Paths
Connected Components
Strongly Connected Components
Label Propagation

We wait until chapter 7 to cover SVDPlusPlus, one of the more useful and advanced built-in algorithms.

5.1. Seek out authoritative nodes: PageRank

Chapter 2 covered an example invoking the PageRank algorithm, which was originally invented to rank web pages for a search engine, but there we used it to find influential papers in a citation network. Generally speaking, PageRank can be used to find the “important” nodes in almost any graph. Here we go more in depth into PageRank: what it does under the covers and parameters and different ways of invoking it. Note that PageRank is patented by Stanford University and trademarked by Google.

5.1.1. PageRank algorithm explained

PageRank is a way to measure the “authority” of vertices in a graph. We saw an example of this in chapter 2 where we measured the influence of scientific papers from within a collection of bibliographic citations. Although the original application of PageRank was to assign an authority number to each web page a search engine crawler encounters, PageRank can be used on any directed graph to establish the authority of every node in the graph. Other applications include ranking key people in a social network graph based on people-to-people connections, ranking influencers in a social network based on a graph of “shares,” and employing several advanced machine-learning techniques, such as collaborative filtering and semantic relevance.

A simplistic alternative to PageRank is to measure the in-degrees at each vertex, similar to the way we calculated the out-degrees in section 4.2.2. The GraphX API even provides an outDegrees() function to compute the out-degrees without any additional code. Many people mistakenly believe that such a number of “inbound links” is all PageRank is. PageRank is much more.

PageRank seeks to optimize a recursive formula and is based not on the number of vertices that have edges which point to the vertex in question, but on the PageRanks of those vertices.

Although the definition of PageRank is recursive, its implementation is a straightforward iterative computation. Figure 5.1, adapted from the 1999 PageRank paper by Page and Brin, “The PageRank citation ranking: Bringing order to the web,” illustrates the algorithm.

Figure 5.1. PageRank iteration. This particular iteration is also a steady and final state of the algorithm because after the redistribution of PageRank among the vertices, the resulting vertex PageRanks end up with the same values.

The algorithm can be described as follows:

1. Initialize vertices with a starting PageRank of 1/N, where N is the number of vertices in the graph.

2. Loop:

For each vertex, transmit a PageRank of 1/M along each outbound edge, where M is the out-degree of the vertex.

At each vertex receiving incoming PageRanks from adjacent vertices, sum these up and make that the new PageRank for the vertex.

If PageRanks haven’t significantly changed across the graph since the previous iteration, then exit.

5.1.2. Invoking PageRank in GraphX

As we saw in chapter 2, GraphX has already implemented PageRank; we didn’t need to code up the algorithm described in the previous subsection. In this section, you’ll see two different ways to invoke PageRank and some of the parameters.

Object-oriented vs. object-based ways to invoke PageRank

GraphX provides two ways to invoke PageRank: object-oriented and object-based. In chapter 2, we invoked the pageRank() function from our Graph object, which is the object-oriented way. The pageRank() function is a function of GraphOps, and Graph automatically creates a GraphOps and provides a conversion of itself to GraphOps whenever needed, much in the same way that an RDD will convert itself to a PairRDD as needed, as we saw in section 4.2.2. The relationships between the various classes are shown in figure 5.2.

Figure 5.2. The companion object `Graph` contains an `implicit` (automatic) conversion from `Graph` to its corresponding `GraphOps`, so all of the operations available in `GraphOps` can be invoked as if those operations were declared in the `Graph` class. `GraphOps` contains a `pageRank()` method, but it calls `run()` from the `PageRank` singleton object, passing in the graph as the first parameter.

Tip

To find out all the methods you can invoke on an instance of Graph, be sure to look at the GraphOps API documentation in addition to the docs for Graph.

The other way of invoking PageRank—the object-based way—is to call the run() method of the singleton object org.apache.spark.graphx.lib.PageRank, passing in the graph as the first parameter. This is what the pageRank() method of GraphOps does. Which way to invoke PageRank is a matter of style; for example, if you’re already performing a number of operations on your graph, the object-oriented style allows you to chain it as an additional operation.

Fixed number of iterations (“static”) vs. tolerance exit condition (“dynamic”)

For each of the two invocation means (object-oriented and object-based), there’s another choice to be made: whether to exit after a specified number of iterations or to continue iterating (potentially forever) and exit only after a tolerance condition has been met. The GraphX API documentation calls the former static and the latter dynamic.

Whereas the static versions take a parameter numIter (number of iterations), the dynamic versions take a parameter tol (tolerance). If a vertex’s PageRank didn’t change by more than tol between the previous iteration and the current iteration, it will pull itself out of the algorithm, neither distributing its PageRank to its neighbors nor paying attention to PageRank being sent to it by its neighbors. tol is also used to determine when the overall algorithm stops: if no vertex in the entire graph changes by more than tol, the algorithm terminates.

When we used tol in chapter 2, we picked a value of 0.001, which is on the high side for quick algorithm termination. For more precise results, pick a smaller value, such as 0.0001.

The Random Reset Probability

All four variations (object-oriented versus object-based and static versus dynamic) take an additional parameter, resetProb, which the API documentation also refers to as alpha. This resetProb parameter corresponds to what the 1998 Brin and Page paper, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” refers to as a damping factor. You specify resetProb in the range [0,1], and it represents a sort of minimum PageRank value.

Conceptually, resetProb corresponds to the probability that an imaginary web surfer will suddenly visit a random page on the web instead of following one of the outbound links prescribed by the web page that the surfer is currently visiting. This is useful for accounting for sinks—web pages that have inbound links but no outbound links. resetProb ensures that all pages always have some minimum PageRank, and also the (1-resetProb) in the preceding formula dampens the contribution of the incoming PageRanks from the adjacent vertices. It is as if imaginary outbound edges are added from all sink vertices to every other vertex in the graph, and to keep things fair, this same thing is done to the non-sink vertices as well.

Note

The resetProb used in GraphX is the same as 1-d described in the PageRank literature, where d is the damping factor. The PageRank literature recommends a value of 0.85 for the damping factor, and the GraphX documentation recommends a value of 0.15 for resetProb.

The following formula incorporates resetProb to compute the new PageRank for a vertex v:

resetProb is in some sense a hack. For a true, ideal PageRank, resetProb should be set to zero; however, doing so leads to longer convergence times and can also lead to degenerate results: clusters of self-interconnected components getting all the PageRank, and the mainstream highly connected vertices getting zero as their PageRank. An example of what can happen is shown in figure 5.3.

Figure 5.3. The real-world graph of web pages is said to resemble a bow-tie. With a damping factor of 1.0 (`resetProb` of .0.0), all the PageRanks would get trapped in the “OUT” vertices, and all the “IN” vertices would be left with a PageRank of 0.

Based on the name resetProb, you might think a random number generator is involved. Although in some implementations of PageRank a random number generator is employed, in the GraphX implementation it’s strictly deterministic, and resetProb is treated as a constant. An example of an implementation that uses a random number generator is one that uses the Monte Carlo technique to estimate PageRank in a shorter amount of computation time.

5.1.3. Personalized PageRank

Suppose that instead of ranking web pages you want to recommend people on a social network. Such recommendations should be tailored to the user looking for other people; the user is more likely to be interested in other people who aren’t too far away on the graph.

Personalized PageRank is a variation on PageRank that gives a rank relative to a specified “source” vertex in the graph. Conceptually, the imaginary web surfer (or social network graph wanderer) described in the previous section, when suddenly deciding to visit another vertex, will always land on the specified source vertex. Within GraphX, this concept of an imaginary web surfer is implemented by enforcing a minimum PageRank only on the specified source vertex; the PageRanks of all the other vertices are allowed to fall to zero (for example, if they have no inbound links).

In chapter 2 you saw PageRank run on a network of paper citations. In that example, the paper “Noncompact Symmetries in String Theory” had the highest PageRank. In the following listing, we specify the source vertex to be 9207016, the ID of that paper.

Listing 5.1. Personalized PageRank to find the most important related paper

import org.apache.spark.graphx._
val g = GraphLoader.edgeListFile(sc, "cit-HepTh.txt")
g.personalizedPageRank(9207016, 0.001)
 .vertices
 .filter(_._1 != 9207016)
 .reduce((a,b) => if (a._2 > b._2) a else b)
res1: (org.apache.spark.graphx.VertexId, Double) =
       (9201015,0.09211875000000003)

When we look up ID 9201015 in the abstracts from the SNAP site, we see it’s “An Algorithm to Generate Classical Solutions for String Effective Action.” According to the Personalized PageRank algorithm, this is the most important paper from the perspective of the paper “Noncompact Symmetries in String Theory.”

The GraphX implementation of Personalized PageRank is limited in a couple of ways compared to implementations on other systems. First, only one source vertex can be specified. If specifying a group of vertices was allowed, this would permit, for example, finding the most important person to a group of people, such as 1992 Harvard alumni. Second, the weight for each source vertex cannot be specified; in the GraphX implementation it’s hard-coded to 1.0, meaning the minimum PageRank for a vertex is either one of two extremes: 0 for vertices other than the source vertex, or 1.0 * resetProb for the source vertex. This isn’t a big limitation right now, when GraphX only allows specifying a single source vertex, but when GraphX gains the capability in the future to specify multiple source vertices, being able to specify weights independently for each source vertex will allow one to conceptually specify some kind of affinity or importance to the rest of the vertices in the set of source vertices.

5.2. Measuring connectedness: Triangle Count

Where PageRank measured the influence of individual vertices, counting triangles can measure the connectedness of a graph or subgraph—how, collectively, the vertices together influence each other. For example, in a social network, if everyone influences everyone else—if everyone is connected to everyone else—there will be a lot of triangles.

A triangle is what it sounds like: three vertices that are all connected with edges. But there can be some subtleties when dealing with directed graphs, as GraphX does. When counting triangles, GraphX treats the graph as if it were undirected, ignoring the edge directions (see figure 5.4), collapsing duplicate edges into one, ignoring direction, and eliminating loop edges from a vertex back to itself.

Figure 5.4. When counting triangles, GraphX doesn’t care about edge direction. These are both triangles, even though the triangle on the left forms a cycle and the triangle on the right has a dead end.

5.2.1. Uses of Triangle Count

The more triangles a graph or subgraph has, the more connected it is. This property can be used to identify cliques (parts of the graph that have a lot of interconnections), provide recommendations, and identify spammers. It’s not always the case that more connectedness is always better. A fully connected graph, where every vertex connects to every other vertex, conveys no information about connectedness (though edge and vertex properties could still carry information). Sometimes a lack of connectedness identifies valuable vertices within the graph; for example, a research paper (“Visualizing the Signatures of Social Roles in Online Discussion Groups” by Welser et al) shows that those who answer questions on online forums are often loners with weak connections to a large number of other forum participants, leading to few triangles. Those helpful loner question-answerers may have edges to a bunch of unrelated question-askers, but those loners aren’t part of any dense networks of people, so few triangles are formed (and only when question-askers happen to also be connected to each other). When trying to identify valuable question-answerers, a tell-tale sign might be a low Triangle Count.

Triangle Count serves as one factor in two other metrics known as the clustering coefficient and the transitivity ratios. These are more complicated to compute because they involve more than counting triangles. But being ratios, they’re scaled/normalized with a denominator, making it easy to compare connectedness of graphs of different sizes. As of version 1.6, GraphX doesn’t have algorithms built in to compute the clustering coefficient or transitivity ratio. Chapter 8 shows how to compute the global clustering coefficient.

5.2.2. Slashdot friends and foes example

Here we’ll show an example of using Triangle Count to measure connectedness among various arbitrary subsets of users of Slashdot.org, the popular technology news and discussion site started in 1997. SNAP, the same Stanford repository of graph data we used in chapter 2, has an anonymized edge list of Slashdot “friends” and “foes.” On Slashdot, a user reading comments can tag authors of forum comments as friends or foes to be reminded of that opinion the next time the user encounters a comment written by the same author. Even though the SNAP Slashdot data is anonymized—the vertex IDs don’t match real-life Slashdot user IDs—the vertex IDs still appear to be in increasing order; longer-term users have lower vertex IDs.

In this data, the vertex IDs start at 0 and go up to over 70,000. We’ll break this up into seven sets of 10,000 vertices each. That means we’ll lose a lot of edges, namely the ones from one subgraph to another. We’ll count the number of triangles within each of the seven subgraphs and see if there is a trend over time. Because long-term users tend to have interacted with each other often, there should be a high degree of interconnectedness. We would expect to see the first subgraph of 10,000 to be a tight-knit, highly connected group, and the second subgraph of 10,000 to have connections divided between themselves within this second subgraph and with the well-respected users in the first subgraph. But remember, we’re discarding those edges from the second subgraph to the first subgraph when we cut out each subgraph. Continuing on, we would expect the third subgraph to have even fewer triangles.

Subgraphs in GraphX

Taking a subgraph in GraphX is straightforward. The subgraph() method of Graph takes two parameters: an edge predicate function and a vertex predicate function. Both are not required; you can specify one. The edge predicate function is presented with every edge in the graph and must return true or false—true if the edge is to be a part of the subgraph. It’s similar with the vertex predicate function. If the edge predicate function filters out all edges to and from a vertex, that vertex remains in the subgraph as a naked vertex with no edges. If the vertex predicate function filters out one or both of the vertices from an edge, that edge doesn’t make it into the subgraph.

Note

In Spark 1.6 and earlier, triangleCount() imposed a couple of severe prerequisites on the graph: first, the graph has to be partitioned by one of the PartitionStrategy options described in section 9.4. Second, if there are any duplicate edges (two or more edges between the same two particular vertices), those duplicate edges have to point in the same direction. The GraphX documentation overstates this latter requirement; it says that all edges must be in canonical order, pointing from the lower-numbered vertex ID to the higher-numbered vertex ID. This is usually the easiest way to transform a graph to meet the second requirement, but if, for example, your graph has no duplicate edges, there’s nothing to worry about (except for the partitioning from the first requirement). Jira ticket SPARK-3650, not targeted to any specific Spark release (as of Spark 1.6), would lift these requirements.

To get started, download the Slashdot friend and foe edge data from http://snap.stanford.edu/data/soc-Slashdot0811.html and uncompress it. Then from the Spark Shell, do what’s shown in the following listing.

Listing 5.2. Triangle Counts on Slashdot friend and foe data

val g = GraphLoader.edgeListFile(sc, "soc-Slashdot0811.txt").cache
val g2 = Graph(g.vertices, g.edges.map(e =>
        if (e.srcId < e.dstId) e else new Edge(e.dstId, e.srcId, e.attr))).
    partitionBy(PartitionStrategy.RandomVertexCut)
(0 to 6).map(i => g2.subgraph(vpred =
        (vid,_) => vid >= i*10000 && vid < (i+1)*10000).
    triangleCount.vertices.map(_._2).reduce(_ + _))
res1: scala.collection.immutable.IndexedSeq[Int] = Vector(1352001, 61376,
10865, 3935, 1384, 786, 658)

Scala Tip

Scala allows two different syntaxes for invoking functions. One is the familiar Java-style, with the preceding period and parameters inside the parentheses. In the other, the period and the parentheses are omitted. Scala allows this if the function has zero parameters (where it’s called suffix notation) or one parameter (where it is called infix notation). Stylistically, Scala programmers typically leave off parentheses whenever possible, especially when a function has no side effects (leaves the underlying object unchanged). In the preceding diagrammed line of code, to is a function of scala.Int (of which the zero (0) preceding it is an instance) and the subsequent 6 is its parameter. to is not a Scala keyword but merely part of the Scala standard library, and documentation on to can be found in the API documentation page on scala.Int.

The portion of the computation of g2 that ensures edge vertex IDs are in ascending order will be unnecessary once SPARK-3650 is fixed.

Our hypothesis was confirmed: each succeeding subgraph of 10,000 vertices had a lower triangle count.

In this simple example, we didn’t care about Triangle Counts on a per-vertex basis, but such information is useful when considering local connectedness as opposed to global graph connectedness.

As with PageRank, there’s an object-based TriangleCount version as well in the graphx.lib package.

5.3. Find the fewest hops: ShortestPaths

The ShortestPaths algorithm built into GraphX counts the number of hops (not using any distance values that may be attached to the edges) and returns the distance in terms of number of hops (not a full path of how to get from one vertex to another).

You might be tempted, based on the name, to use this to plot driving routes on a map, but an algorithm that uses distance values on edges will be covered in section 6.2.

An example where you would want to count hops is counting the shortest number of “friends” edges from each vertex in a social network graph to get to “Fred Marple.” Or, because the GraphX API supports passing in a list of vertices known as landmarks, the shortest distance from each vertex in the graph to any of those could be computed—for example, anyone in the class of ‘79. An example from another domain is counting network hops in a computer network to the nearest tier-one node.

For a simple example, using the example graph from figure 1.5 and constructed in the Spark Shell in listing 4.1, the code in listing 5.3 computes the shortest number of hops from each vertex in the graph to Charles. Note that even though the algorithm is formally named “shortest paths,” from an API perspective, the GraphX implementation only returns the shortest distances. The result is shown in figure 5.5.

Figure 5.5. `ShortestPaths` finds the number of hops (shown in the squares) from every vertex to a particular specified vertex (in this case, vertex #3). It does not take into account any edge weights (which might, for example, represent distances on a map), and it only returns the number of hops, not any routes on how to achieve that shortest distance.

Listing 5.3. Invoking `ShortestPaths`

lib.ShortestPaths.run(myGraph,Array(3)).vertices.collect
res2: Array[(org.apache.spark.graphx.VertexId,
     org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((4,Map()),
     (1,Map(3 -> 2)), (3,Map(3 -> 0)), (5,Map()), (2,Map(3 -> 1)))

Note that ShortestPaths can be invoked only in the object-based style, as there is no corresponding method in Graph or GraphOps.

5.4. Finding isolated populations: Connected Components

Connected Components can find cliques in social network graphs and “partitioning” in a data center network. The Connected Components algorithm is relevant for both directed and undirected graphs. The code to construct the graph in figure 5.6 and find its Connected Components is shown in the following listing.

Figure 5.6. This one graph of 7 vertices and 5 edges has three Connected Components. If this graph data were from a social network, each Connected Component would be considered a clique.

Listing 5.4. Invoking connectedComponents()

val g = Graph(sc.makeRDD((1L to 7L).map((_,""))),
    sc.makeRDD(Array(Edge(2L,5L,""), Edge(5L,3L,""), Edge(3L,2L,""),
                     Edge(4L,5L,""), Edge(6L,7L,"")))).cache
g.connectedComponents.vertices.map(_.swap).groupByKey.map(_._2).collect
res3: Array[Iterable[org.apache.spark.graphx.VertexId]] = Array(
CompactBuffer(1), CompactBuffer(6, 7), CompactBuffer(4, 3, 5, 2))

connectedComponents returns a new Graph object with the same structure as the input graph. Connected Components are identified by the lowest vertex ID in the component, and this value is assigned as an attribute to each vertex. We have the result shown in table 5.1.

Table 5.1. Connected Components are identified by the lowest vertex ID.

Component ID	Component members
1	1
2	2, 3, 4, 5
6	6, 7

As with PageRank and some of the other built-in algorithms, there are both object-oriented and object-based ways to invoke GraphX’s Connected Components implementation.

5.4.1. Predicting social circles

Now that you’ve seen how easy it is to generate Connected Components in GraphX, let’s put the algorithm to work on a real-world dataset. Not only will we get to run the Connected Components algorithm we’ve looked at, we’ll also see how to import data, manipulate it into the structure that we need, and output the results in a particular format. We’ll use a dataset derived from Facebook that was used in a 2014 Kaggle data science competition.

Kaggle (www.kaggle.com) hosts competitions in which a dataset is provided for download and the participants are set a task to predict a certain outcome for each record in the dataset. Over the course of the competition, competitors get to submit their predictions, which are scored for their accuracy compared to some ground truth known only to the competition organizers. At the end of the competition, the competitor with the best score wins, with prizes ranging from cash to job offers.

We’re going to use data from the Learning Social Circles from Networks competition. The data was collected from a small number of Facebook users who had supplied information on friends in their network. In addition to the graph of their network, each contributing user was asked to allocate their friends to one or more social circles. A social circle is some grouping of the user’s friends that made sense to that user. For example, it could be colleagues at work, people from the same school, or a group of friends they socialize with. What constitutes a social circle was left up to the user. Circles could overlap, be completely contained by one or more other circles, and could even be empty.

The aim of the competition was to use the network information to get the best prediction of how users would be grouped into circles. Clearly there are numerous ways that we could tackle the problem, but we’ll use a simple approach of finding connected components within the graph of each user’s connections. Our prediction of what circles each user has and how friends are allocated to those circles are then the Connected Components.

Getting the Kaggle data on social networks

To download the data from Kaggle, you’ll need to set up an account with Kaggle. Once you’ve done that, navigate to www.kaggle.com/c/learning-social-circles/data. This page lists a number of files, but you want the one called egonets.zip. You’ll be asked to accept the competition rules even though the competition has ended, so go ahead and accept. Unzip the file and have a look at the directory—we saved the egonets folder underneath a folder structure called socialcircles/data, as shown in figure 5.7.

Figure 5.7. The egonets folder contains 111 egonet files, one for each user.

Note

The term egonet comes from a paper by Julian McAuley and Jure Leskovec from Stanford in which they describe individual users as egos and users’ connections as alters. All very Freudian!

The dataset is anonymized so that each contributing user is given an ID, and you’ll find an egonet file for each user. Open one up and look at its contents. We’ve chosen 3077.egonet because it is small enough to display on the page; here are its contents:

3078: 3085 3089
3079: 3082
3080: 3089
3081: 3085 3083 3089
3082: 3079 3086 3089
3083: 3085 3081 3089
3084:
3085: 3083 3078 3081 3088 3089
3086: 3082
3087:
3088: 3085 3089
3089: 3085 3080 3083 3078 3082 3081 3088

The egonet file lists each of the user’s friends and, for each of those friends, who their connections are. There is one row for each of the user’s friends (again anonymized and given numeric IDs). The format is

Friend-id: Space-seperated list of connection-ids

User 3077 has 12 friends (IDs 3078 to 3089). Friend 3078 is connected to two of 3077’s other friends, 3085 and 3089. Friends 3084 and 3087 seem to be loners in the group who aren’t connected to any of 3077’s friends. Figure 5.8 shows a graph of these connections. In this instance we have one big graph for everyone but 3084 and 3087 and singleton vertices for the two loners. We should find three connected components which will form the basis for three social circles. Other egonets have a more complicated structure.

Figure 5.8. The egonet for user 3077 forms three connected components: two single-vertex components (vertices 3084 and 3087) and one large component (all the other vertices).

Reading folder contents with wholeTextFiles

Our task is to read in each of the egonet files, create a graph from the friends and their connections, find the connected components, and output the resulting social circles.

As with many real-world problems, we have to accept the input in the format it is given to us and convert that input into the structures we need to create graphs. In this case, the input data is not in a single convenient file but rather is scattered across the files of a directory. Luckily we can use a method on SparkContext called wholeTextFiles to read the contents of a folder into an RDD:

val egonets = sc.wholeTextFiles("socialcircles/data/egonets")

wholeTextFiles returns a PairRDD with one element for each file; the key is the folder path to the file, and the value is the contents of the file (see figure 5.9).

Figure 5.9. A folder path is passed to `wholeTextFiles` to generate a `PairRDD` with the file path as the key for each element and the file contents for the value. This is both good performing and a convenient way to read a directory of text files, resulting in all the data being in a single RDD.

Finding social circles

Listing 5.5 shows the full code to read in the input, generate the social circles, and output the predictions to the console. Once wholeTextFiles has loaded the egonets directory into an RDD called egonets, we generate two arrays, egonets_numbers and egonets_edges, using a map operation on each element of egonets.

The first map operation uses a method we define called extract. This uses a regular expression to extract the user ID from the filename.

Scala Tip

Regular expressions (or regexes) can be created by appending r to the end of a string. Regexes are usually a pain to write in Java because every backslash has to be delimited by another backslash. For this reason, Scala provides “raw” strings using a triple-quoting syntax as used in the example. Finally Scala regexes provide “extractor” methods that allow you to designate variables that a matching regex will populate. The code val Pattern(num) = s populates the num variable with the group matching the string s.

The second map expression calls a helper method, make_edges, which parses each egonet file’s contents to create edges between each friend connection. Another helper method, get_circles, creates a graph of the egonet using Graph.fromEdgeTuples. Once we have the graph, we call connectedComponents to derive the social circles (see the following listing).

Listing 5.5. Find and list social circles

Scala Tip

A for comprehension is like a map() combined with a filter(). The syntax starts off with something like x <- myCollection, which is similar to enhanced for-loops introduced in Java 5. Immediately following is an optional guard which acts like the filter(). Finally comes the yield {} which acts like the function passed into a map(); the difference is that what would be the function parameter of the function passed into map() is declared at the beginning of the for comprehension.

The competition requires a particular format for predictions. Each social circle is represented as a space-separated list of user IDs, and each social circle is separated by a semicolon (;). The code we use to do this is as follows:

In this example, we have output the user ID and the social circles in the format required for the competition; but the choice of output depends on what subsequent use we want to make of the information. Spark allows you to easily push the data to an external database, output to a real-time system, or even integrate into a more extensive machine-learning pipeline.

5.5. Reciprocated love only, please: Strongly Connected Components

For directed graphs, sometimes we might want to eliminate dead ends from our components. In social networks, Strongly Connected Components can form a basis for a recommendation engine if other aspects are added to the engine. Another application is ensuring that in a state machine there are no dead ends where the state machine could get stuck. They are also useful in building optimizing compilers for when they do data flow analysis to identify expressions that never get used and would otherwise be wastefully computed.

Figure 5.10. In Strongly Connected Components, every vertex is reachable from every other vertex in the component. Within a Strongly Connected Component, no vertex can act as a dead end.

Invoking stronglyConnectedComponents() is similar to invoking connectedComponents() except that a parameter numIter is required. Assuming g is defined as in the previous section, the following listing finds its Strongly Connected Components.

Listing 5.6. Invoking `stronglyConnectedComponents()`

g.stronglyConnectedComponents(10).vertices.map(_.swap).groupByKey.
    map(_._2).collect
res4: Array[Iterable[org.apache.spark.graphx.VertexId]] = Array(
CompactBuffer(4), CompactBuffer(1), CompactBuffer(6), CompactBuffer(7),
CompactBuffer(3, 5, 2))

5.6. Community detection: LabelPropagation

To identify close-knit communities within a graph, GraphX provides the label propagation algorithm (LPA) as described by Raghavan et al in their 2007 paper “Near linear time algorithm to detect community structures in large-scale networks.” The idea is to have densely connected groups of vertices form a consensus on a unique label and so define communities.

Definition

Many iterative algorithms are guaranteed to get closer to a particular result on each iteration of the algorithm; they converge. With algorithms that have this property, it’s reasonable to run the algorithm for as many iterations as required and use a tolerance test to exit the algorithm when they’re “close enough.” Algorithms that don’t converge could continue forever without converging, so we need to specify an upper limit on the number of iterations that will be run. Inevitably in this situation there is a trade-off between the accuracy of the end result and the time the algorithm takes to run.

Unfortunately, LPA often doesn’t converge. Figure 5.11 shows an example of non-convergence—the graph in step 5 is the same as in step 3, and the algorithm continues forever ping-ponging between the two graphs that look like steps 4 and 5. For that reason, GraphX only provides a static version that runs for a number of iterations you specify and doesn’t provide a dynamic version with a tolerance-terminating condition.

Figure 5.11. The LPA algorithm often doesn’t converge. Step 5 is the same as step 3, meaning that steps 3 and 4 keep repeating forever.

Despite its name, LPA is also not applicable to the use case of propagating classifications of vertices—propagating labels from vertices of known classification to vertices of unknown classification. Section 7.3 explores this use case, called semi-supervised learning.

LPA, in contrast, uses as its initial labels the vertex ID, as shown in step 0 of figure 5.11 (see the following listing). LPA doesn’t care about edge direction, effectively treating the graph as an undirected graph. The flip-flopping of two sets of labels that is shown in steps 3 through 5 in figure 5.11 is illustrated in a similar example in the original Raghavan paper.

Listing 5.7. Invoking `LabelPropagation`

val v = sc.makeRDD(Array((1L,""), (2L,""), (3L,""), (4L,""), (5L,""),
 (6L,""), (7L,""), (8L,"")))
val e = sc.makeRDD(Array(Edge(1L,2L,""), Edge(2L,3L,""), Edge(3L,4L,""),
 Edge(4L,1L,""), Edge(1L,3L,""), Edge(2L,4L,""), Edge(4L,5L,""),
 Edge(5L,6L,""), Edge(6L,7L,""), Edge(7L,8L,""), Edge(8L,5L,""),
 Edge(5L,7L,""), Edge(6L,8L,"")))
lib.LabelPropagation.run(Graph(v,e),5).vertices.collect.
 sortWith(_._1<_._1)
res5: Array[(org.apache.spark.graphx.VertexId,
     org.apache.spark.graphx.VertexId)] = Array((1,2), (2,1), (3,1), (4,2),
     (5,4), (6,5), (7,5), (8,4))

5.7. Summary

GraphX’s built-in algorithms range widely in their usefulness, power, and applicability.
PageRank is useful for a number of different applications beyond ranking web pages for a search engine.
Personalized PageRank is useful for ranking “people you may know” in a social network.
Triangle Count can serve as a gross measure for connectedness, but another measure to be introduced in chapter 8, the Global Clustering Coefficient, has the advantage of always being within the range of 0 to 1, facilitating comparisons between graphs of different sizes.
Connected Components and Strongly Connected Components can find social circles in social networks.
GraphX’s Label Propagation is less useful because it rarely converges.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5. Built-in algorithms

Create new playlist

Sign In

Sign Up

Chapter 5. Built-in algorithms

5.1. Seek out authoritative nodes: PageRank

5.1.1. PageRank algorithm explained

Figure 5.1. PageRank iteration. This particular iteration is also a steady and final state of the algorithm because after the redistribution of PageRank among the vertices, the resulting vertex PageRanks end up with the same values.

5.1.2. Invoking PageRank in GraphX

Object-oriented vs. object-based ways to invoke PageRank

Tip

Fixed number of iterations (“static”) vs. tolerance exit condition (“dynamic”)

The Random Reset Probability

Note

Figure 5.3. The real-world graph of web pages is said to resemble a bow-tie. With a damping factor of 1.0 (resetProb of .0.0), all the PageRanks would get trapped in the “OUT” vertices, and all the “IN” vertices would be left with a PageRank of 0.

5.1.3. Personalized PageRank

Listing 5.1. Personalized PageRank to find the most important related paper

5.2. Measuring connectedness: Triangle Count

Figure 5.4. When counting triangles, GraphX doesn’t care about edge direction. These are both triangles, even though the triangle on the left forms a cycle and the triangle on the right has a dead end.

5.2.1. Uses of Triangle Count

5.2.2. Slashdot friends and foes example

Note

Listing 5.2. Triangle Counts on Slashdot friend and foe data

Scala Tip

5.3. Find the fewest hops: ShortestPaths

Listing 5.3. Invoking ShortestPaths

5.4. Finding isolated populations: Connected Components

Figure 5.6. This one graph of 7 vertices and 5 edges has three Connected Components. If this graph data were from a social network, each Connected Component would be considered a clique.

Listing 5.4. Invoking connectedComponents()

Table 5.1. Connected Components are identified by the lowest vertex ID.

5.4.1. Predicting social circles

Getting the Kaggle data on social networks

Figure 5.7. The egonets folder contains 111 egonet files, one for each user.

Note

Figure 5.8. The egonet for user 3077 forms three connected components: two single-vertex components (vertices 3084 and 3087) and one large component (all the other vertices).

Reading folder contents with wholeTextFiles

Figure 5.9. A folder path is passed to wholeTextFiles to generate a PairRDD with the file path as the key for each element and the file contents for the value. This is both good performing and a convenient way to read a directory of text files, resulting in all the data being in a single RDD.

Finding social circles

Scala Tip

Listing 5.5. Find and list social circles

Scala Tip

5.5. Reciprocated love only, please: Strongly Connected Components

Figure 5.10. In Strongly Connected Components, every vertex is reachable from every other vertex in the component. Within a Strongly Connected Component, no vertex can act as a dead end.

Listing 5.6. Invoking stronglyConnectedComponents()

5.6. Community detection: LabelPropagation

Definition

Figure 5.11. The LPA algorithm often doesn’t converge. Step 5 is the same as step 3, meaning that steps 3 and 4 keep repeating forever.

Listing 5.7. Invoking LabelPropagation

5.7. Summary

Table of Contents for
Chapter 5. Built-in algorithms

Figure 5.3. The real-world graph of web pages is said to resemble a bow-tie. With a damping factor of 1.0 (`resetProb` of .0.0), all the PageRanks would get trapped in the “OUT” vertices, and all the “IN” vertices would be left with a PageRank of 0.

Listing 5.3. Invoking `ShortestPaths`

Figure 5.9. A folder path is passed to `wholeTextFiles` to generate a `PairRDD` with the file path as the key for each element and the file contents for the value. This is both good performing and a convenient way to read a directory of text files, resulting in all the data being in a single RDD.

Listing 5.6. Invoking `stronglyConnectedComponents()`

Listing 5.7. Invoking `LabelPropagation`