The graph visualization

Spark and GraphX do not provide any built-in functionality for data visualization, since their focus is on data processing. However, pictures are worth than thousands of numbers when it comes to data analysis. In the following sections, we will build a Spark application for visualizing and analyzing the connectedness of graphs. We will rely on the third-party library called GraphStream for drawing networks, and BreezeViz for plotting structural properties of graphs, such as degree distribution. These libraries are not perfect and have limitations but they are relatively stable and simple to use. So, we will use them for exploring the graph examples that are used in this chapter.

Note

Currently, there is still a lack of graph visualization engines and libraries for drawing large-scale networks, without requiring a huge amount of computing resources. For example, the popular network analysis software SNAP currently relies on the GraphViz engine to draw networks, but it can only draw small- to medium-sized networks. Gephi is another tool for doing interactive network visualization. Although it has nice features, such as a multilevel layout and a built-in 3D rendering engine, Gephi still requires a high CPU and memory requirements. For drawing standards plots, the new project Apache Zeppelin offers a web-based notebook for interactive data analysis and visualization. It also provides a built-in integration with Spark. Visit the official website for more information.

Installing the GraphStream and BreezeViz libraries

Let's get going by installing the third-party libraries and their dependencies in the $SPARKHOME /lib folder. GraphStream is an awesome Java library that enables the visualization of dynamic networks, which can evolve with time. For our purpose, we only need to display static networks so that we only need to download two JAR files called gs-core-1.2.jar and gs-ui-1.2.jar for the core and UI libraries. They can be downloaded from the following repositories:

Put these two JAR files in the lib folder, within the project root directory. Next, download the breeze_2.10-0.9.jar and breeze-viz_2.10-0.9.jar libraries from the following repositories:

Since BreezeViz is a Scala library that depends on another Java library called JfreeChart, you will also need to install jcommon-1.0.16.jar and jfreechart-1.0.13.jar. These JAR files can be found in the following repositories:

After you have downloaded all these four JAR files, copy them into the lib folder within the project root directory. You are now ready to draw your first graph from the Spark shell.

Visualizing the graph data

Open the terminal, with the current directory set to $SPARKHOME. Launch the Spark shell. This time you will need to specify the third-party JAR files with the --jars option:

$ ./bin/spark-shell --jars 
lib/breeze-viz_2.10-0.9.jar,
lib/breeze_2.10-0.9.jar,
lib/gs-core-1.2.jar,
lib/gs-ui-1.2.jar,
lib/jcommon-1.0.16.jar,
lib/jfreechart-1.0.13.jar

Alternatively, you can save yourself some typing with the shorter command:

$./bin/spark-shell  --jars 
$(find "." -name '*.jar' | xargs echo | tr ' ' ',')

Instead of specifying each JAR one at a time, the preceding command loads all the JARs.

As a first example, we will visualize the social ego network that we have seen in the previous chapter. First, we need to import the GraphStream classes with the following:

scala> import org.graphstream.graph.{Graph => GraphStream}
import org.graphstream.graph.{Graph=>GraphStream}
scala> import org.graphstream.graph.implementations._ 
import org.graphstream.graph.implementations._

It is important that we rename org.graphstream.graph.Graph to GraphStream, to avoid a namespace collision with the Graph class of GraphX.

Next, load the social ego network data using Graph.fromEdges, as we did in the previous chapter. After that, we will create a SingleGraph object:

// Create a SingleGraph class for GraphStream visualization 
val graph: SingleGraph = new SingleGraph("EgoSocial") 

The SingleGraph object is a GraphStream abstraction that enables the manipulation and visualization of graph data. Concretely, we can invoke the addNode and addEdge methods of the SingleGraph object to add the network nodes and links. We can also invoke the addAttribute method on either the graph, or each individual edge and node to set their visual attributes. What's cool about the GraphStream API is that it cleanly separates the graph structure and visualization using a CSS-like style sheet to control the way the graph elements are displayed. It is much easier to see this in action. So, let's create a file named stylesheet and put it in a new ./style/ folder. Insert the following lines in the style sheet:

node { 
    fill-color: #a1d99b; 
    size: 20px; 
    text-size: 12; 
    text-alignment: at-right; 
    text-padding: 2; 
    text-background-color: #fff7bc; 
} 
edge { 
    shape: cubic-curve; 
    fill-color: #dd1c77; 
    z-index: 0; 
    text-background-mode: rounded-box; 
    text-background-color: #fff7bc; 
    text-alignment: above; 
    text-padding: 2; 
}

The preceding style sheet describes the visual styles of the graph elements using selectors nodes and edges, and specifying their visual attributes using key-value pairs. In this example, we set the colors and shapes of the nodes, edges, and their text attributes. An exhaustive reference for the style sheet attributes used in GraphStream is available at http://graphstream-project.org/doc/Tutorials/Graph-Visualisation_1.1/.

With the style sheet now ready, we can connect it to the SingleGraph object graph:

// Set up the visual attributes for graph visualization 
graph.addAttribute("ui.stylesheet","url(file:.//style/stylesheet)") 
graph.addAttribute("ui.quality") 
graph.addAttribute("ui.antialias") 

In the last two lines, we simply informed the rendering engine to favor quality instead of speed. Next, we have to reload the graph that we built in the previous chapter. To avoid repetitions, we omit the graph building part. After this, we now load VertexRDD and EdgeRDD of the social network into the GraphStream graph object, with the following code:

// Given the egoNetwork, load the graphX vertices into GraphStream
for ((id,_) <- egoNetwork.vertices.collect()) { 
val node = graph.addNode(id.toString).asInstanceOf[SingleNode] 
} 
// Load the graphX edges into GraphStream edges 
for (Edge(x,y,_) <- egoNetwork.edges.collect()) { 
val edge = graph.addEdge(x.toString ++ y.toString, x.toString, y.toString, 
true).
     asInstanceOf[AbstractEdge] 
}

To add a node, we simply pass its vertex ID as a string argument. For the edges, we need to pass four arguments to the addEdge method. The first one is a string identifier for each edge. Since this identifier is not available in the original dataset or in the GraphX graph, we had to create one. Well, here the simplest solution was to concatenate the vertex IDs of the nodes that each edge links to.

Tip

In the preceding code, we had to use a subtle trick to avoid an interoperability issue between our Scala code and the GraphStream Java library. As described in the org.graphstream.graph.implementations.AbstractGraph API of GraphStream, the addNode and addEdge methods return the node and edge respectively. However, as GraphStream is a third-party Java library, we had to force the return types of addNode and addEdge using the asInstanceOf[T] method with the T type being SingleNode and AbstractEdge, respectively. So what would have happened if we omitted these explicit type conversions? You would get a rather strange exception saying:

java.lang.ClassCastException: org.graphstream.graph.implementations.SingleNode cannot be cast to scala.runtime.Nothing$

Now what? The only thing to do here is to make the social ego network display it. Just call the display method on graph:

graph.display()

Voila! You now will see the network drawn in a new window, as shown in the following:

Visualizing the graph data

Tip

If your graph is not displayed with the colors above, you should check that the stylesheet's file path is correct when setting the graph's attribute called ui.stylesheet.

Plotting the degree distribution

As shown by this visualization, each person in the ego network seems to be either isolated or connected to a large group of mutual friends. We can further analyze this fact by plotting the degree distribution of the network. To do this with the help of the Spark shell is as easy as before. Make sure that you first import some classes from JFreeChart and Breeze:

import org.jfree.chart.axis.ValueAxis 
import breeze.linalg._ 
import breeze.plot._ 

We will then employ the degreeHistogram function that we built in Chapter 2, Building and Exploring Graphs. For convenience, its definition is shown as follows:

def degreeHistogram(net: Graph[Int, Int]): Array[(Int, Int)] =  
    net.degrees.map(t => (t._2,t._1)). 
          groupByKey.map(t => (t._1,t._2.size)). 
          sortBy(_._1).collect()

From the degree histogram, we can obtain the degree distribution, which is the probability distribution of the node degrees over the whole network. For this, we just normalize the node degrees by the total number of nodes, so that the degree probabilities add up to one:

val nn = egoNetwork.numVertices 
val egoDegreeDistribution = degreeHistogram(egoNetwork).map({case (d,n) => (d,n.toDouble/nn)})

To display the degree distribution, we first create a Figure object called f and two plot objects called p1 and p2. In the following code, p1 = f.subplot(2,1,0) and p2 = f.subplot(2,1,1) specify that f will have two subplots, and that p1 is displayed above p2. Indeed, the first two arguments of the subplot are the number of rows and columns of the figure, whereas the third argument denotes the subplot index, which starts at 0:

val f = Figure()
val p1 = f.subplot(2,1,0) 
val x = new DenseVector(egoDegreeDistribution map (_._1.toDouble)) 
val y = new DenseVector(egoDegreeDistribution map (_._2))
p1.xlabel = "Degrees" 
p1.ylabel = "Distribution" 
p1 += plot(x, y) 
p1.title = "Degree distribution of social ego network"
val p2 = f.subplot(2,1,1) 
val egoDegrees = egoNetwork.degrees.map(_._2).collect()

p1.xlabel = "Degrees" 
p1.ylabel = "Histogram of node degrees" 
p2 += hist(egoDegrees, 10)

This code will then display the degree distribution and degree frequencies of the ego network:

Plotting the degree distribution
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset