In this chapter, we will learn to transform graphs using different sets of operators. In particular, we will cover graph-specific operators that either change the properties of graph elements or modify the structure of graphs. In other words, all the operators that we use here are methods that are invoked on a graph and return a new graph. In addition, we will use join methods to combine graph data with other datasets. Using real-world datasets, you will understand when and how to:
The map operator is a core method for transforming distributed datasets or RDDs in Spark. Similarly, property graphs also have three map operators defined as follows:
class Graph[VD, ED] { def mapVertices[VD2](mapFun: (VertexId, VD) => VD2): Graph[VD2, ED] def mapEdges[ED2](mapFun: Edge[ED] => ED2): Graph[VD, ED2] def mapTriplets[ED2](mapFun: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2] }
Each of these methods is called on a property graph with vertex attribute type VD
and edge attribute type ED
. Each of them also takes a user-defined mapping function mapFun
that performs one of the following:
mapVertices
, mapFun
takes a pair of (VertexId, VD)
as input and returns a transformed vertex attribute of type VD2
.mapEdges
, mapFun
takes an Edge
object as input and returns a transformed edge attribute of type ED2
.mapTriplets
, mapFun
takes an EdgeTriplet
object as input and returns a transformed edge attribute of type ED2
.In each case, the graph structure remains intact, meaning these map operators never change the links between the vertices or their vertex indices. This is one key advantage of these operators compared to the basic RDD map operator. Although the latter can be used to achieve the same result, the former is also more efficient, thanks to the GraphX system optimization. Therefore, these three mapping operators should always be used if you just want to transform a graph's attributes without modifying its structure.
The difference between mapEdges
and mapTriplets
is that, for the latter, both the edge and source attributes are available in the triplet input of mapFun
to create a new edge attribute. In contrast, the mapFun
in mapEdges
has access to only the edge attribute.
Now, let's see them in action through some simple examples.
Consider a social graph between people, where the vertex attribute has a type Person
and the edge attribute has a type Link
. First, let's create these Scala types as follows:
case class Person(first: String, last: String, age: Int) case class Link(relationship: String, duration: Float)
Suppose we build the graph from VertexRDD
called people
and an EdgeRDD
collection named links
:
val inputGraph: Graph[Person, Link] = Graph(people, links)
If we want, we can transform the attributes of the people to contain only their name using mapVertices
:
val outputGraph: Graph[String, Link] = inputGraph.mapVertices((_, person) => person.first + person.last)
The new outputGraph
now has a vertex attribute of type String
instead of Person
. The links between the people remain unchanged.
Similarly, suppose we are interested only in the nature of relationships, not their duration. This time, we can use mapEdges
to change the edge attribute as follows:
val outputGraph: Graph[Person, String] = inputGraph.mapEdges(link => link.relationship)
Finally, suppose we want to keep track of the people's ages from when they first met and add this information into the edge attribute. We can do that by using mapTriplets
:
val outputGraph: Graph[Person, (Int, Int)] = inputGraph.mapTriplets(t => (t.srcAttr.age - t.attr.duration, t.dstAttr.age - t.attr.duration))
If we want to change both the edge and vertex attributes of a graph, we can simply chain mapEdges
or mapTriplets
with mapVertices
since each of these methods always returns a property graph.