Modifying graph structures

The GraphX library also comes with four useful methods for changing the structure of graphs. Their method signatures are listed as follows:

class Graph[VD, ED] {
  def reverse: Graph[VD, ED]
  
  def subgraph(epred: EdgeTriplet[VD,ED] => Boolean,
               vpred: (VertexId, VD) => Boolean): Graph[VD, ED]
               
  def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
  
  def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
}

The reverse operator

As its name suggests, the reverse operator returns a new graph with all the edge directions reversed. It does not modify any vertex or edge properties, or change the number of edges. Moreover, its implementation does not produce any data movement or duplication.

The subgraph operator

Next on the list is subgraph, which is useful for filtering graphs. It takes two predicate functions as arguments that return Boolean values. The first predicate epred takes an EdgeTriplet and returns true when the triplet satisfies the predicate. Similarly, the vpred predicate takes a pair of (VertexId, VD) and returns true when the vertex satisfies the predicate condition.

Using these predicates, subgraph returns the graph containing only the nodes that satisfy the vertex predicate and keeps only the edges satisfying the edge predicate between the remaining nodes. By default, the vertex or edge predicate functions are set to return true when they are not provided. That means that we can pass only an edge predicate, only a vertex predicate, or both. If only a vertex predicate is passed to subgraph and two connected vertices are filtered out, then the edge connecting these nodes will automatically be filtered out as well.

The subgraph operator is very handy for countless situations. For instance, it is often the case, in practice, that the graphs have isolated nodes or edges with missing vertex information. We can eliminate these graph elements using subgraph. Another situation where subgraph is useful is when we want to remove hubs in the graph, for example, nodes that are connected to too many nodes.

As a concrete example, let's use subgraph to answer the following question often encountered in social networks: "Which people in my friends' list of friends are not yet my friends?":

// Given a social network 
type Name = String
class Person(name: Name, friends: List[Name])
val socialNetwork: Graph[Person, Int] = ... 

// that I am part of
val me = Person(myName, myFriends)

// I want know my friends' friends that are not yet my friends
val potentialFriends = socialNetwork.subgraph(vpred = 
(_, p: Person) => !(me.friends contains p.name)) 

Note

Note that we did not pass an edge predicate as an argument to subgraph. Thus, Scala uses the default value for epred, which is a function that always returns true. On the other hand, we should pass vpred as a named parameter so that Scala knows which predicate is passed or is missing.

The mask operator

The mask operator also filters a graph on which it is invoked. In contrast to subgraph, mask does not take predicate functions as arguments. Instead, it takes another graph. Then, the expression graph.mask(anotherGraph) constructs a subgraph of graph by returning a graph that contains the vertices and edges that are also found in anotherGraph. This can be used together with the subgraph operator to filter a graph based on the properties in another related graph.

Consider the following situation where we want to find the connected components of a graph but we want to remove vertices with missing attribute information in the resulting graph. We can then run the connectedComponent algorithm we previously saw and use subgraph and mask together to obtain the desired result. This is shown in the following code:

// Run Connected Components
val ccGraph = graph.connectedComponents() 

// Remove vertices with missing attribute values and the edges connected to them
val validGraph = graph.subgraph(vpred = 
(_, attr) => attr.info != "NA")

// Restrict the resulting components to the valid subgraph
val validCCGraph = ccGraph.mask(validGraph) 

The groupEdges operator

Spark's property graphs are allowed to pair any of the connected nodes to have multiple edges. The groupEdges operator is another structural operator that merges duplicate edges between each pair of nodes into a single edge. To do that, groupEdges requires one function argument named merge, which takes a pair of edge attributes of type ED and combines them into a single attribute value of the same type. As a result, the graph returned by groupEdges has the same type as the original one. Later in this chapter, we will work on a detailed example in which we will see groupEdges in action.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset