Organizing clusters as a hierarchical tree

In this section, we will take a look at an alternative approach to prototype-based clustering: hierarchical clustering. One advantage of hierarchical clustering algorithms is that it allows us to plot dendrograms (visualizations of a binary hierarchical clustering), which can help with the interpretation of the results by creating meaningful taxonomies. Another useful advantage of this hierarchical approach is that we do not need to specify the number of clusters up front.

The two main approaches to hierarchical clustering are agglomerative and divisive hierarchical clustering. In divisive hierarchical clustering, we start with one cluster that encompasses all our samples, and we iteratively split the cluster into smaller clusters until each cluster only contains one sample. In this section, we will focus on agglomerative clustering, which takes the opposite approach. We start with each sample as an individual cluster and merge the closest pairs of clusters until only one cluster remains.

Grouping clusters in bottom-up fashion

The two standard algorithms for agglomerative hierarchical clustering are single linkage and complete linkage. Using single linkage, we compute the distances between the most similar members for each pair of clusters and merge the two clusters for which the distance between the most similar members is the smallest. The complete linkage approach is similar to single linkage but, instead of comparing the most similar members in each pair of clusters, we compare the most dissimilar members to perform the merge. This is shown in the following diagram:

Grouping clusters in bottom-up fashion

Note

Other commonly used algorithms for agglomerative hierarchical clustering include average linkage and Ward's linkage. In average linkage, we merge the cluster pairs based on the minimum average distances between all group members in the two clusters. In Ward's linkage, the two clusters that lead to the minimum increase of the total within-cluster SSE are merged.

In this section, we will focus on agglomerative clustering using the complete linkage approach. Hierarchical complete linkage clustering is an iterative procedure that can be summarized by the following steps:

  1. Compute the distance matrix of all samples.
  2. Represent each data point as a singleton cluster.
  3. Merge the two closest clusters based on the distance between the most dissimilar (distant) members.
  4. Update the similarity matrix.
  5. Repeat steps 2-4 until one single cluster remains.

Next, we will discuss how to compute the distance matrix (step 1). But first, let's generate some random sample data to work with: the rows represent different observations (IDs 0-4), and the columns are the different features (X, Y, Z) of those samples:

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(123)
>>> variables = ['X', 'Y', 'Z']
>>> labels = ['ID_0','ID_1','ID_2','ID_3','ID_4']
>>> X = np.random.random_sample([5,3])*10
>>> df = pd.DataFrame(X, columns=variables, index=labels)
>>> df

After executing the preceding code, we should now see the following data frame containing the randomly generated samples:

Grouping clusters in bottom-up fashion

Performing hierarchical clustering on a distance matrix

To calculate the distance matrix as input for the hierarchical clustering algorithm, we will use the pdist function from SciPy's spatial.distance submodule:

>>> from scipy.spatial.distance import pdist, squareform
>>> row_dist = pd.DataFrame(squareform(
...            pdist(df, metric='euclidean')),
...            columns=labels, index=labels)
>>> row_dist

Using the preceding code, we calculated the Euclidean distance between each pair of sample points in our dataset based on the features X, Y, and Z. We provided the condensed distance matrix—returned by pdist—as input to the squareform function to create a symmetrical matrix of the pair-wise distances as shown here:

Performing hierarchical clustering on a distance matrix

Next, we will apply the complete linkage agglomeration to our clusters using the linkage function from SciPy's cluster.hierarchy submodule, which returns a so-called linkage matrix .

However, before we call the linkage function, let us take a careful look at the function documentation:

>>> from scipy.cluster.hierarchy import linkage
>>> help(linkage)
[…]
Parameters:
  y : ndarray
    A condensed or redundant distance matrix. A condensed 
    distance matrix is a flat array containing the upper 
    triangular of the distance matrix. This is the form 
    that pdist returns. Alternatively, a collection of m 
    observation vectors in n dimensions may be passed as 
    an m by n array.

  method : str, optional
    The linkage algorithm to use. See the Linkage Methods 
    section below for full descriptions.

  metric : str, optional
    The distance metric to use. See the distance.pdist 
    function for a list of valid distance metrics.

 Returns:
  Z : ndarray
    The hierarchical clustering encoded as a linkage matrix.
[…]

Based on the function description, we conclude that we can use a condensed distance matrix (upper triangular) from the pdist function as an input attribute. Alternatively, we could also provide the initial data array and use the 'euclidean' metric as a function argument in linkage. However, we should not use the squareform distance matrix that we defined earlier, since it would yield different distance values than expected. To sum it up, the three possible scenarios are listed here:

  • Incorrect approach: Using the squareform distance matrix shown in the following code snippet would lead to incorrect results:
    >>> from scipy.cluster.hierarchy import linkage
    >>> row_clusters = linkage(row_dist,
    ...                        method='complete',
    ...                        metric='euclidean')
  • Correct approach: Using the condensed distance matrix as shown in the following code example yields the correct pairwise distance matrix:
    >>> row_clusters = linkage(pdist(df, metric='euclidean'),
    ...                        method='complete')
  • Correct approach: Using the complete input sample matrix as shown in the following code snippet also leads to a correct distance matrix similar to the preceding approach:
    >>> row_clusters = linkage(df.values, 
    ...                        method='complete', 
    ...                        metric='euclidean')

To take a closer look at the clustering results, we can turn clustering results into a pandas DataFrame (best viewed in a Jupyter Notebook) as follows:

>>> pd.DataFrame(row_clusters,
...      columns=['row label 1',
...               'row label 2',
...               'distance', 
...               'no. of items in clust.'],
...      index=['cluster %d' %(i+1) for i in
...             range(row_clusters.shape[0])])

As shown in the following screenshot, the linkage matrix consists of several rows where each row represents one merge. The first and second columns denote the most dissimilar members in each cluster, and the third column reports the distance between those members. The last column returns the count of the members in each cluster:

Performing hierarchical clustering on a distance matrix

Now that we have computed the linkage matrix, we can visualize the results in the form of a dendrogram:

>>> from scipy.cluster.hierarchy import dendrogram
# make dendrogram black (part 1/2)
# from scipy.cluster.hierarchy import set_link_color_palette
# set_link_color_palette(['black'])
>>> row_dendr = dendrogram(row_clusters, 
...                       labels=labels,
...                       # make dendrogram black (part 2/2)
...                       # color_threshold=np.inf
...                       )
>>> plt.tight_layout()
>>> plt.ylabel('Euclidean distance')
>>> plt.show()

If you are executing the preceding code or reading an ebook version of this book, you will notice that the branches in the resulting dendrogram are shown in different colors. The coloring scheme is derived from a list of Matplotlib colors that are cycled for the distance thresholds in the dendrogram. For example, to display the dendrograms in black, you can uncomment the respective sections that I inserted in the preceding code:

Performing hierarchical clustering on a distance matrix

Such a dendrogram summarizes the different clusters that were formed during the agglomerative hierarchical clustering; for example, we can see that the samples ID_0 and ID_4, followed by ID_1 and ID_2, are the most similar ones based on the Euclidean distance metric.

Attaching dendrograms to a heat map

In practical applications, hierarchical clustering dendrograms are often used in combination with a heat map, which allows us to represent the individual values in the sample matrix with a color code. In this section, we will discuss how to attach a dendrogram to a heat map plot and order the rows in the heat map correspondingly.

However, attaching a dendrogram to a heat map can be a little bit tricky, so let's go through this procedure step by step:

  1. We create a new figure object and define the x axis position, y axis position, width, and height of the dendrogram via the add_axes attribute. Furthermore, we rotate the dendrogram 90 degrees counter-clockwise. The code is as follows:
    >>> fig = plt.figure(figsize=(8,8), facecolor='white')
    >>> axd = fig.add_axes([0.09,0.1,0.2,0.6])
    >>> row_dendr = dendrogram(row_clusters, orientation='left')
    >>> # note: for matplotlib < v1.5.1, please use orientation='right'
  2. Next, we reorder the data in our initial DataFrame according to the clustering labels that can be accessed from the dendrogram object, which is essentially a Python dictionary, via the leaves key. The code is as follows:
    >>> df_rowclust = df.iloc[row_dendr['leaves'][::-1]]
  3. Now, we construct the heat map from the reordered DataFrame and position it next to the dendrogram:
    >>> axm = fig.add_axes([0.23,0.1,0.6,0.6])
    >>> cax = axm.matshow(df_rowclust, 
    ...              interpolation='nearest', cmap='hot_r')
  4. Finally, we will modify the aesthetics of the dendrogram by removing the axis ticks and hiding the axis spines. Also, we will add a color bar and assign the feature and sample names to the x and y axis tick labels, respectively:
    >>> axd.set_xticks([])
    >>> axd.set_yticks([])
    >>> for i in axd.spines.values():
    ...     i.set_visible(False)
    >>> fig.colorbar(cax)
    >>> axm.set_xticklabels([''] + list(df_rowclust.columns))
    >>> axm.set_yticklabels([''] + list(df_rowclust.index))
    >>> plt.show()

After following the previous steps, the heat map should be displayed with the dendrogram attached:

Attaching dendrograms to a heat map

As we can see, the order of rows in the heat map reflects the clustering of the samples in the dendrogram. In addition to a simple dendrogram, the color-coded values of each sample and feature in the heat map provide us with a nice summary of the dataset.

Applying agglomerative clustering via scikit-learn

In the previous subsection, we saw how to perform agglomerative hierarchical clustering using SciPy. However, there is also an AgglomerativeClustering implementation in scikit-learn, which allows us to choose the number of clusters that we want to return. This is useful if we want to prune the hierarchical cluster tree. By setting the n_cluster parameter to 3, we will now cluster the samples into three groups using the same complete linkage approach based on the Euclidean distance metric, as before:

>>> from sklearn.cluster import AgglomerativeClustering
>>> ac = AgglomerativeClustering(n_clusters=3,
...                              affinity='euclidean', 
...                              linkage='complete')
>>> labels = ac.fit_predict(X)
>>> print('Cluster labels: %s' % labels)
Cluster labels: [1 0 0 2 1]

Looking at the predicted cluster labels, we can see that the first and the fifth sample (ID_0 and ID_4) were assigned to one cluster (label 1), and the samples ID_1 and ID_2 were assigned to a second cluster (label 0). The sample ID_3 was put into its own cluster (label 2). Overall, the results are consistent with the results that we observed in the dendrogram. We shall note though that ID_3 is more similar to ID_4 and ID_0 than to ID_1 and ID_2, as shown in the preceding dendrogram figure; this is not clear from scikit-learn's clustering results. Let's now rerun the AgglomerativeClustering using n_cluster=2 in the following code snippet:

>>> ac = AgglomerativeClustering(n_clusters=2,
...                             affinity='euclidean',
...                             linkage='complete')
>>> labels = ac.fit_predict(X)
>>> print('Cluster labels: %s' % labels)
Cluster labels: [0 1 1 0 0]

As we can see, in this pruned clustering hierarchy, label ID_3 was assigned to the same cluster as ID_0 and ID_4, as expected.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset