In this section, we will take a look at an alternative approach to prototype-based clustering: hierarchical clustering. One advantage of hierarchical clustering algorithms is that it allows us to plot dendrograms (visualizations of a binary hierarchical clustering), which can help with the interpretation of the results by creating meaningful taxonomies. Another useful advantage of this hierarchical approach is that we do not need to specify the number of clusters up front.
The two main approaches to hierarchical clustering are agglomerative and divisive hierarchical clustering. In divisive hierarchical clustering, we start with one cluster that encompasses all our samples, and we iteratively split the cluster into smaller clusters until each cluster only contains one sample. In this section, we will focus on agglomerative clustering, which takes the opposite approach. We start with each sample as an individual cluster and merge the closest pairs of clusters until only one cluster remains.
The two standard algorithms for agglomerative hierarchical clustering are single linkage and complete linkage. Using single linkage, we compute the distances between the most similar members for each pair of clusters and merge the two clusters for which the distance between the most similar members is the smallest. The complete linkage approach is similar to single linkage but, instead of comparing the most similar members in each pair of clusters, we compare the most dissimilar members to perform the merge. This is shown in the following diagram:
Other commonly used algorithms for agglomerative hierarchical clustering include average linkage and Ward's linkage. In average linkage, we merge the cluster pairs based on the minimum average distances between all group members in the two clusters. In Ward's linkage, the two clusters that lead to the minimum increase of the total within-cluster SSE are merged.
In this section, we will focus on agglomerative clustering using the complete linkage approach. Hierarchical complete linkage clustering is an iterative procedure that can be summarized by the following steps:
Next, we will discuss how to compute the distance matrix (step 1). But first, let's generate some random sample data to work with: the rows represent different observations (IDs 0-4), and the columns are the different features (X
, Y
, Z
) of those samples:
>>> import pandas as pd >>> import numpy as np >>> np.random.seed(123) >>> variables = ['X', 'Y', 'Z'] >>> labels = ['ID_0','ID_1','ID_2','ID_3','ID_4'] >>> X = np.random.random_sample([5,3])*10 >>> df = pd.DataFrame(X, columns=variables, index=labels) >>> df
After executing the preceding code, we should now see the following data frame containing the randomly generated samples:
To calculate the distance matrix as input for the hierarchical clustering algorithm, we will use the pdist
function from SciPy's spatial.distance
submodule:
>>> from scipy.spatial.distance import pdist, squareform >>> row_dist = pd.DataFrame(squareform( ... pdist(df, metric='euclidean')), ... columns=labels, index=labels) >>> row_dist
Using the preceding code, we calculated the Euclidean distance between each pair of sample points in our dataset based on the features X
, Y
, and Z
. We provided the condensed distance matrix—returned by pdist
—as input to the squareform
function to create a symmetrical matrix of the pair-wise distances as shown here:
Next, we will apply the complete linkage agglomeration to our clusters using the linkage
function from SciPy's cluster.hierarchy
submodule, which returns a so-called linkage matrix
.
However, before we call the linkage
function, let us take a careful look at the function documentation:
>>> from scipy.cluster.hierarchy import linkage >>> help(linkage) […] Parameters: y : ndarray A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of m observation vectors in n dimensions may be passed as an m by n array. method : str, optional The linkage algorithm to use. See the Linkage Methods section below for full descriptions. metric : str, optional The distance metric to use. See the distance.pdist function for a list of valid distance metrics. Returns: Z : ndarray The hierarchical clustering encoded as a linkage matrix. […]
Based on the function description, we conclude that we can use a condensed distance matrix (upper triangular) from the pdist
function as an input attribute. Alternatively, we could also provide the initial data array and use the 'euclidean'
metric as a function argument in linkage
. However, we should not use the squareform
distance matrix that we defined earlier, since it would yield different distance values than expected. To sum it up, the three possible scenarios are listed here:
squareform
distance matrix shown in the following code snippet would lead to incorrect results:>>> from scipy.cluster.hierarchy import linkage >>> row_clusters = linkage(row_dist, ... method='complete', ... metric='euclidean')
>>> row_clusters = linkage(pdist(df, metric='euclidean'), ... method='complete')
>>> row_clusters = linkage(df.values, ... method='complete', ... metric='euclidean')
To take a closer look at the clustering results, we can turn clustering results into a pandas DataFrame
(best viewed in a Jupyter Notebook) as follows:
>>> pd.DataFrame(row_clusters, ... columns=['row label 1', ... 'row label 2', ... 'distance', ... 'no. of items in clust.'], ... index=['cluster %d' %(i+1) for i in ... range(row_clusters.shape[0])])
As shown in the following screenshot, the linkage matrix consists of several rows where each row represents one merge. The first and second columns denote the most dissimilar members in each cluster, and the third column reports the distance between those members. The last column returns the count of the members in each cluster:
Now that we have computed the linkage matrix, we can visualize the results in the form of a dendrogram:
>>> from scipy.cluster.hierarchy import dendrogram # make dendrogram black (part 1/2) # from scipy.cluster.hierarchy import set_link_color_palette # set_link_color_palette(['black']) >>> row_dendr = dendrogram(row_clusters, ... labels=labels, ... # make dendrogram black (part 2/2) ... # color_threshold=np.inf ... ) >>> plt.tight_layout() >>> plt.ylabel('Euclidean distance') >>> plt.show()
If you are executing the preceding code or reading an ebook version of this book, you will notice that the branches in the resulting dendrogram are shown in different colors. The coloring scheme is derived from a list of Matplotlib colors that are cycled for the distance thresholds in the dendrogram. For example, to display the dendrograms in black, you can uncomment the respective sections that I inserted in the preceding code:
Such a dendrogram summarizes the different clusters that were formed during the agglomerative hierarchical clustering; for example, we can see that the samples ID_0
and ID_4
, followed by ID_1
and ID_2
, are the most similar ones based on the Euclidean distance metric.
In practical applications, hierarchical clustering dendrograms are often used in combination with a heat map, which allows us to represent the individual values in the sample matrix with a color code. In this section, we will discuss how to attach a dendrogram to a heat map plot and order the rows in the heat map correspondingly.
However, attaching a dendrogram to a heat map can be a little bit tricky, so let's go through this procedure step by step:
figure
object and define the x axis position, y axis position, width, and height of the dendrogram via the add_axes
attribute. Furthermore, we rotate the dendrogram 90 degrees counter-clockwise. The code is as follows:>>> fig = plt.figure(figsize=(8,8), facecolor='white') >>> axd = fig.add_axes([0.09,0.1,0.2,0.6]) >>> row_dendr = dendrogram(row_clusters, orientation='left') >>> # note: for matplotlib < v1.5.1, please use orientation='right'
DataFrame
according to the clustering labels that can be accessed from the dendrogram object, which is essentially a Python dictionary, via the leaves
key. The code is as follows:>>> df_rowclust = df.iloc[row_dendr['leaves'][::-1]]
DataFrame
and position it next to the dendrogram:>>> axm = fig.add_axes([0.23,0.1,0.6,0.6]) >>> cax = axm.matshow(df_rowclust, ... interpolation='nearest', cmap='hot_r')
>>> axd.set_xticks([]) >>> axd.set_yticks([]) >>> for i in axd.spines.values(): ... i.set_visible(False) >>> fig.colorbar(cax) >>> axm.set_xticklabels([''] + list(df_rowclust.columns)) >>> axm.set_yticklabels([''] + list(df_rowclust.index)) >>> plt.show()
After following the previous steps, the heat map should be displayed with the dendrogram attached:
As we can see, the order of rows in the heat map reflects the clustering of the samples in the dendrogram. In addition to a simple dendrogram, the color-coded values of each sample and feature in the heat map provide us with a nice summary of the dataset.
In the previous subsection, we saw how to perform agglomerative hierarchical clustering using SciPy. However, there is also an AgglomerativeClustering
implementation in scikit-learn, which allows us to choose the number of clusters that we want to return. This is useful if we want to prune the hierarchical cluster tree. By setting the n_cluster
parameter to 3
, we will now cluster the samples into three groups using the same complete linkage approach based on the Euclidean distance metric, as before:
>>> from sklearn.cluster import AgglomerativeClustering >>> ac = AgglomerativeClustering(n_clusters=3, ... affinity='euclidean', ... linkage='complete') >>> labels = ac.fit_predict(X) >>> print('Cluster labels: %s' % labels) Cluster labels: [1 0 0 2 1]
Looking at the predicted cluster labels, we can see that the first and the fifth sample (ID_0
and ID_4
) were assigned to one cluster (label 1
), and the samples ID_1
and ID_2
were assigned to a second cluster (label 0
). The sample ID_3
was put into its own cluster (label 2
). Overall, the results are consistent with the results that we observed in the dendrogram. We shall note though that ID_3
is more similar to ID_4
and ID_0
than to ID_1
and ID_2
, as shown in the preceding dendrogram figure; this is not clear from scikit-learn's clustering results. Let's now rerun the AgglomerativeClustering
using n_cluster=2
in the following code snippet:
>>> ac = AgglomerativeClustering(n_clusters=2, ... affinity='euclidean', ... linkage='complete') >>> labels = ac.fit_predict(X) >>> print('Cluster labels: %s' % labels) Cluster labels: [0 1 1 0 0]
As we can see, in this pruned clustering hierarchy, label ID_3
was assigned to the same cluster as ID_0
and ID_4
, as expected.