Chapter 7. Learning Graph Structures

In this chapter, we will show you how to learn interesting structures from graphs in Spark. In principle, one learns and finds relationships from data by first selecting the problem of interest. The most common learning problems are regression, classification, ranking, and clustering. In this book, we will focus on clustering. In particular, we will focus on graph data, and apply clustering to detect communities within the graphs. Here is our roadmap for this chapter. First, we will introduce the concepts of spectral clustering. Then, we will study a specific method, which allows us to cluster graphs in Spark. Finally, we will apply these techniques to music and song playlist datasets. This application will also serve as an opportunity to review the tools and techniques that we covered in the previous chapters. We will bring them together in this chapter.

Community clustering in graphs

Clustering is a learning problem in which given entities, such as objects or people, are partitioned into subsets, according to a defined similarity measure. The entities within the same cluster are very similar, and are different from all entities in other clusters. Clustering is done with an unsupervised method. In other words, it operates on unlabeled data, which are the attributes or features of the entities. Moreover, clustering methods can be broadly classified into parametric versus non parametric approaches. The parametric approaches impose a probability model on the data. Some examples of the parametric methods are Gaussian Mixture Model (GMM) and Latent Dirichlet Allocation (LDA). On the other hand, the non parametric models infer the structure of the clusters from the data itself. Examples include k-means and spectral clustering. All these cited methods are available in Spark's MLlib library.

Before we continue, it is important to understand why clustering is related to graph processing. There are two reasons for this. The first reason is that clustering is very useful for detecting "communities" in graphs. These communities are essentially clusters of nodes that share similar features. While two nodes are not explicitly connected, clustering can reveal their similarities by learning from their attribute data. For instance, online social and dating websites use such information to suggest people you may know, or the partners you would be interested to meet. Conversely, clustering can be helpful to uncover interesting structures in highly connected networks. The second reason is that the clustering method that we will see here is based on graph processing. In particular, we will focus our attention on the power iteration clustering (PIC), which is a simple and fast spectral clustering method.

Spectral clustering

As mentioned previously, the aim of clustering is to divide the data points into several clusters in such a way that the points in the same cluster are very similar, and the points in different clusters are dissimilar to each other. A "similarity graph" is a nice way to represent the similarities of the data points. Each point becomes a node in the similarity graph, whereas each edge has, as its attribute, the "similarity measure" of the connected nodes. As a result, the clustering problem reduces to find a partition of a graph in such a way that the edges between the different groups have very low weights, and the edges within a group have high weights. To do this, we use the spectral clustering technique, which basically reduces the high-dimensional similarity graph to a low-dimensional representation. To keep our discussion really simple, we will avoid the math. However, the technical details can be learned by reading some good tutorials, such as the one available at http://arxiv.org/abs/0711.0189.

Power iteration clustering

An efficient and scalable spectral clustering method is the power iteration clustering (PIC) method. It is defined in the MLlib library, precisely in http://spark.apache.org/docs/latest/mllib-clustering.html#power-iteration-clustering-pic. It is implemented in Spark using GraphX's processing APIs, and the caching optimizations. Here is the API for this PIC clustering method:

class PowerIterationClustering {

    // Run the PIC algorithm.
    def run(similarities: RDD[(Long, Long, Double)]): PowerIterationClusteringModel
    
    // Set the initialization mode. Either "random" or "degree"
    def setInitializationMode(mode: String): PowerIterationClustering.this.type

    // Set the number of clusters.
    def setK(k: Int): PowerIterationClustering.this.type

    // Set maximum number of iterations of the power iteration loop
    def setMaxIterations(maxIterations: Int): PowerIterationClustering.this.type
}

To apply the PIC clustering to graphs, we will need to follow these five steps:

  1. First, load the data into a Spark graph property.
  2. Second, extract the features of the nodes.
  3. Third, define a similarity measure between the two nodes.
  4. Next, create an affinity matrix, based on the initial graph using the similarity measure.
  5. Finally, run the k-means clustering on the affinity matrix.

Steps 1 and 2 can be done using the graph builder methods that we learned in Chapter 2, Building and Exploring Graphs. Step 3 simply requires us to define a function that determines how similar the two nodes are. The choice of similarity measure depends on the nodes' features, and the problem at hand. Nonetheless, there exists standard measures from which we can choose. For instance, if the node feature is a binary vector, we can use the Jaccard similarity. On the other hand, a Gaussian kernel function can be used when the node feature is a real vector. These are not the only possibilities, and we can also define our own measure.

In Step 4, the affinity matrix similarities should be represented by an RDD of (i, j, sim) tuples. The similarity sim must be a nonnegative number. For any edge (i j) with a nonzero similarity, there should be either (i, j, sim), or (j, i, sim) in the input. Since the affinity matrix must be symmetric, if only (i, j, sim) is available in the data, the reciprocal (j, i, sim) is assumed, and vice versa. Moreover, tuples with i = j are simply ignored.

The last step consists of two steps. First, we create PowerIterationClusteringModel from the similarities matrix, and then we run a k-means clustering on it. Before running the clustering model, we must also choose two parameters:

  • The maximum number of iterations for the k-means clustering
  • The maximum number of clusters, K

A sketch of the application of PIC is shown in the following code:

import org.apache.spark.mllib.clustering.PowerIterationClustering

// Define pairwise similarities based on initial graph
val similarities: RDD[(Long, Long, Double)] = ...

// Create the PIC clustering model
val pic = new PowerIteartionClustering()
  .setK(maxClusterNumber)
  .setMaxIterations(maxIterations)

// Run the PIC clustering model
val clusteringResult: RDD[Assignment] = pic.run(similarities).assignments

clusteringResult.collect().foreach { a =>
  println(s"${a.id} -> ${a.cluster}")
}

The PIC method returns an RDD of assignment, which abstracts a tuple of VertexId, and Int that corresponds to the node ID, and its cluster group.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset