Agglomerative clustering

In contrast to algorithms, such as k-means, where the dataset is partitioned into individual groups, agglomerative or hierarchical clustering techniques start by considering each datapoint as its own cluster and merging them together into larger groups from the bottom up (Maimon, Oded, and Lior Rokach, eds. Data mining and knowledge discovery handbook. Vol. 2. New York: Springer, 2005). The classical application of this idea is in phylogenetic trees in evolution, where common ancestors connect individual organisms. Indeed, these methods organize the data into tree diagrams, known as dendrograms, which visualize how the data is sequentially merged into larger groups.

The basic steps of an agglomerative algorithm are (diagrammed visually in the figure below):

  1. Start with each point in its own cluster.
  2. Compare each pair of datapoints using a distance metric. This could be any of the methods discussed above.
  3. Use a linkage criterion to merge datapoints (at the first stage) or clusters (in subsequent phases), where linkage is represented by a function such as:
    • The maximum (also known as complete linkage) distance between two sets of points.The minimum (also known as single linkage) distance between two sets of points.
    • The average distance between two sets of points, also known as Unweighted Pair Group Method with Arithmetic Mean (UPGMA). The points in each group could also be weighted to give a weighted average, or WUPGMA.
    • The difference between centroids (centers of mass), or UPGMC.
    • The squared Euclidean distance between two sets of points, or Ward's Method (Ward Jr, Joe H. Hierarchical grouping to optimize an objective function. Journal of the American statistical association 58.301 (1963): 236-244).
    • Repeat steps 2-3 until there is only a single cluster containing all data points. Note that following the first round, the first stage of clusters become a new point to compare with all other clusters, and at each stage the clusters formed become larger as the algorithm proceeds. Along the way, we will construct a tree diagram as we sequentially merge clusters from the prior steps together.
    Agglomerative clustering

    Agglomerative clustering: From top to bottom, example of tree construction (right) from a dataset (left) by sequentially merging closest points

Note that we could also run this process in reverse, taking an initial dataset and splitting it into individual points, which would also construct a tree diagram. In either case, we could then find clusters at several levels of resolution by choosing a cutoff depth of the tree and assigning points to the largest cluster they have been assigned to below that cutoff. This depth is often calculated using the linkage score given in step 3, allowing us conceptually to choose an appropriate distance between groups to consider a cluster (either relatively close or far away as we move up the tree).

Where agglomerative clustering fails

Agglomerative algorithms have many of the same ingredients as k-means; we choose a number of clusters (which will determine where we cut the tree generated by clustering—in the most extreme case, all points become members of a single cluster) and a similarity metric. We also need to choose a linkage metric for step 3, which determines the rules for merging individual branches of our tree. Can agglomerative clustering succeed where k-means failed? Trying this approach on the circular data suggests otherwise, as shown by plotting results of the following commands:

>>> from sklearn.cluster import AgglomerativeClustering
>>> agglomerative_clusters = AgglomerativeClustering(2,linkage='ward').fit_predict(X=np.array(df)[:,1:])
>>> df.plot(kind='scatter', x='x_coord', y='y_coord', c=agglomerative_clusters)
Where agglomerative clustering fails

In order to correctly group the inner and outer circle, we can attempt to modify our notion of similarity using connectivity, a concept taken from graph analysis in which a set of nodes are connected by edges, and connectivity refers to whether two nodes share an edge between them. Here, we essentially construct a graph between pairs of points by thresholding the number of points that can be considered similar to one another, rather than measuring a continuous distance metric between each pair of points. This potentially reduces our difficulties with the concentric circles data, since if we set a very small value (say 10 nearby points), the uniform distance from the middle to the outer ring is no longer problematic because the central points will always be closer to each other than to the periphery. To construct this connectivity-based similarity, we could take a distance matrix such as those we've already calculated, and threshold it for some value of similarity by which we think points are connected, giving us a binary matrix of 0 and 1. This kind of matrix, representing the presence or absence of an edge between nodes in a graph, is also known as an adjacency matrix. We could choose this value through inspecting the distribution of pairwise similarity scores, or based on prior knowledge. We can then supply this matrix as an argument to our agglomerative clustering routine, providing a neighborhood of points to be considered when comparing datapoints, which gives an initial structure to the clusters. We can see that this makes a huge difference to the algorithms results after running the following commands. Note that when we generate the adjacency matrix L, we may end up with an asymmetric matrix since we threshold the ten most similar points for each member of the data. This could lead to situations where two points are not mutually closest to each other, leading to an edge represented in only the top or bottom triangle of the adjacency matrix. To generate a symmetric input for our clustering algorithm, we take the average of the matrix L added to its transpose, which effectively adds edges in both directions between two points.

>>> from sklearn.cluster import AgglomerativeClustering
>>> from sklearn.neighbors import kneighbors_graph
>>> L = kneighbors_graph(np.array(df)[:,1:], n_neighbors=10, include_self=False)
>>> L = 0.5 * (L + L.T)
>>> agglomerative_clusters = AgglomerativeClustering(n_clusters=2,connectivity=L,linkage='average').fit_predict(X=np.array(df)[:,1:])
>>> df.plot(kind='scatter', x='x_coord', y='y_coord', c=agglomerative_clusters)

Now, as you can see, this algorithm can correctly identify and separate two clusters:

Interestingly, constructing this neighborhood graph and partitioning it into sub graphs (splitting the whole graph into a set of nodes and edges that are primarily connected to each other, rather than to other elements of the network) is equivalent to performing k-means clustering on a transformed distance matrix, an approach known as Spectral Clustering (Von Luxburg, Ulrike. A tutorial on spectral clustering. Statistics and computing 17.4 (2007): 395-416). The transformation here is from taking the Euclidean distance D we calculated earlier and calculating the kernel score—the Gaussian kernel given by:

Where agglomerative clustering fails

between each pair of points i and j, with a bandwidth γ instead of making a hard threshold as when we constructed the neighborhood before. Using the pairwise kernel matrix K calculated from all points i and j, we can then construct the Laplacian matrix of a graph, which is given by:

Where agglomerative clustering fails

Here, I is the identity matrix (with a one along the diagonal and zero elsewhere), and D is the diagonal matrix whose elements are :

Where agglomerative clustering fails

Giving the number of neighbors for each point i. In essence by calculating L we now represent the dataset as a series of nodes (points) connected by edges (the elements of this matrix), which have been normalized so that the total value of all edges for each node sums to 1. Since the Gaussian kernel score is continuous, in this normalization divides pairwise distances between a given point and all others into a probability distribution where the distances (edges) sum to 1.

You may recall from linear algebra that Eigenvectors of a matrix A are vectors v for which, if we multiply the matrix by the eigenvector v, we get the same result as if we had multiplied the vector by a constant amount λ: Where agglomerative clustering fails. Thus, the matrix here represents a kind of operation on the vector. For example, the identity matrix gives an eigenvalue of 1, since multiplying v by the identity gives v itself. We could also have a matrix such as:

Where agglomerative clustering fails

which doubles the value of the vector with which it is multiplied, suggesting the matrix acts as a 'stretching' operation on the vector. From this perspective, larger eigenvalues correspond to greater stretching of the vector, while the eigenvector gives the direction in which the stretching occurs. This is useful because it gives us the primary axes along which the matrix operation acts. In our example, if we take the two eigenvectors with the largest eigenvalues (in essence, the directions in which the matrix represents the greatest transformation of a vector), we are extracting the two greatest axes of variation in the matrix. We will return to this concept in more detail when we discuss Principal components in Chapter 6, Words and Pixels – Working with Unstructured Data, but suffice to say that if we run run-kmeans on these eigenvectors (an approach known as spectral clustering, since the eigenvalues of a matrix that we cluster are known as the spectrum of a matrix), we get a result very similar to the previous agglomerative clustering approach using neighborhoods, as we can see from executing the following commands:

>>> spectral_clusters = sklearn.cluster.SpectralClustering(2).fit_predict(np.array(df)[:,1:])
>>> df.plot(kind='scatter', x='x_coord', y='y_coord', c=spectral_clusters)
Where agglomerative clustering fails

We can successfully capture this nonlinear separation boundary because we've represented the points in the space of the greatest variation in pairwise distance, which is the difference between the inner and outermost circle in the data:

The preceding examples should have given you a number of approaches you can use to tackle clustering problems, and as a rule of thumb guide, the following diagram illustrates the decision process for choosing between them:

Where agglomerative clustering fails

For the last part of our exploration of clustering, let's look at an example application utilizing Spark Streaming and k-means, which will allow us to incrementally update our clusters as they are received.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset