Introducing GraphFrames

While the GraphX framework is based on the RDD API, GraphFrames is an external Spark package built on top of the DataFrames API. It inherits the performance advantages of DataFrames using the catalyst optimizer. It can be used in the Java, Scala, and Python programming languages. GraphFrames provides additional functionalities over GraphX such as motif finding, DataFrame-based serialization, and graph queries. GraphX does not provide the Python API, but GraphFrames exposes the Python API as well.

It is easy to get started with GraphFrames. On a Spark 2.0 cluster, let's start a Spark shell with the packages option using the same data used to create the graph in the Creating a graph section of this chapter:

$SPARK_HOME/bin/spark-shell --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11

import org.graphframes._

val vertex = spark.createDataFrame(List(
    ("1","Jacob",48),
    ("2","Jessica",45),
    ("3","Andrew",25),
    ("4","Ryan",53),
    ("5","Emily",22),
    ("6","Lily",52)
)).toDF("id", "name", "age")

val edges = spark.createDataFrame(List(
    ("6","1","Sister"),
    ("1","2","Husband"),
    ("2","1","Wife"),
    ("5","1","Daughter"),
    ("5","2","Daughter"),
    ("3","1","Son"),
    ("3","2","Son"),
    ("4","1","Friend"),
    ("1","5","Father"),
    ("1","3","Father"),
    ("2","5","Mother"),
    ("2","3","Mother")
)).toDF("src", "dst", "relationship")

val graph = GraphFrame(vertex, edges)

Once a graph is created, all graph operations and algorithms can be executed. Some of the basic operations are shown as follows:

scala> graph.vertices.show()
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Jacob| 48|
|  2|Jessica| 45|
|  3| Andrew| 25|
|  4|   Ryan| 53|
|  5|  Emily| 22|
|  6|   Lily| 52|
+---+-------+---+

scala> graph.edges.show()
+---+---+------------+
|src|dst|relationship|
+---+---+------------+
|  6|  1|      Sister|
|  1|  2|     Husband|
|  2|  1|        Wife|
|  5|  1|    Daughter|
|  5|  2|    Daughter|
|  3|  1|         Son|
|  3|  2|         Son|
|  4|  1|      Friend|
|  1|  5|      Father|
|  1|  3|      Father|
|  2|  5|      Mother|
|  2|  3|      Mother|
+---+---+------------+

scala> graph.vertices.groupBy().min("age").show()
+--------+
|min(age)|
+--------+
|      22|
+--------+

Motif finding

The motif finding algorithm is used to search for structural patterns in a graph. GraphFrame-based motif finding uses DataFrame-based DSL for finding structural patterns. The following example, graph.find("(a)-[e]->(b); (b)-[e2]->(a)"), will search for pairs of vertices a, and b, connected by edges in both directions. It will return a DataFrame of all such structures in the graph with columns for each of the named elements (vertices or edges) in the motif. In this case, the returned columns will be a, b, e, and e2:

scala> val motifs = graph.find("(a)-[e]->(b); (b)-[e2]->(a)")
scala> motifs.show()

+--------------+--------------+--------------+--------------+
|             e|             a|             b|            e2|
+--------------+--------------+--------------+--------------+
| [1,2,Husband]|  [1,Jacob,48]|[2,Jessica,45]|    [2,1,Wife]|
|    [2,1,Wife]|[2,Jessica,45]|  [1,Jacob,48]| [1,2,Husband]|
|[5,1,Daughter]|  [5,Emily,22]|  [1,Jacob,48]|  [1,5,Father]|
|[5,2,Daughter]|  [5,Emily,22]|[2,Jessica,45]|  [2,5,Mother]|
|     [3,1,Son]| [3,Andrew,25]|  [1,Jacob,48]|  [1,3,Father]|
|     [3,2,Son]| [3,Andrew,25]|[2,Jessica,45]|  [2,3,Mother]|
|  [1,5,Father]|  [1,Jacob,48]|  [5,Emily,22]|[5,1,Daughter]|
|  [1,3,Father]|  [1,Jacob,48]| [3,Andrew,25]|     [3,1,Son]|
|  [2,5,Mother]|[2,Jessica,45]|  [5,Emily,22]|[5,2,Daughter]|
|  [2,3,Mother]|[2,Jessica,45]| [3,Andrew,25]|     [3,2,Son]|
+--------------+--------------+--------------+--------------+

Now, let's filter the results as follows:

scala> motifs.filter("b.age > 30").show()
+--------------+--------------+--------------+-------------+
|             e|             a|             b|           e2|
+--------------+--------------+--------------+-------------+
| [1,2,Husband]|  [1,Jacob,48]|[2,Jessica,45]|   [2,1,Wife]|
|    [2,1,Wife]|[2,Jessica,45]|  [1,Jacob,48]|[1,2,Husband]|
|[5,1,Daughter]|  [5,Emily,22]|  [1,Jacob,48]| [1,5,Father]|
|[5,2,Daughter]|  [5,Emily,22]|[2,Jessica,45]| [2,5,Mother]|
|     [3,1,Son]| [3,Andrew,25]|  [1,Jacob,48]| [1,3,Father]|
|     [3,2,Son]| [3,Andrew,25]|[2,Jessica,45]| [2,3,Mother]|
+--------------+--------------+--------------+-------------+

You can cache the returned DataFrame if you are doing multiple operations. You can convert the GraphFrame to GraphX using toGraphX and also you can convert a GraphX graph to GraphFrame using fromGraphX.

Loading and saving GraphFrames

Since GraphFrames are built on top of DataFrames, they inherit all DataFrame-supported DataSources. You can write GraphFrames to the Parquet, JSON, and CSV formats. The following example shows how to write a GraphFrame to Parquet and read the same Parquet files to build the graph again:

scala> graph.vertices.write.parquet("vertices")
scala> graph.edges.write.parquet("edges")

scala> val verticesDF = spark.read.parquet("vertices")
scala> val edgesDF = spark.read.parquet("edges")
scala> val sameGraph = GraphFrame(verticesDF, edgesDF)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset