In the previous chapters, we used supervised learning techniques to build machine learning models using data where the answer was already known—the class labels were already available in our training data. In this chapter, we will switch gears and explore cluster analysis, a category of unsupervised learning techniques that allows us to discover hidden structures in data where we do not know the right answer upfront. The goal of clustering is to find a natural grouping in data such that items in the same cluster are more similar to each other than those from different clusters.
Given its exploratory nature, clustering is an exciting topic and, in this chapter, you will learn about the following concepts that can help you to organize data into meaningful structures:
In this section, we will discuss one of the most popular clustering algorithms, k-means, which is widely used in academia as well as in industry. Clustering (or cluster analysis) is a technique that allows us to find groups of similar objects, objects that are more related to each other than to objects in other groups. Examples of business-oriented applications of clustering include the grouping of documents, music, and movies by different topics, or finding customers that share similar interests based on common purchase behaviors as a basis for recommendation engines.
As we will see in a moment, the k-means algorithm is extremely easy to implement but is also computationally very efficient compared to other clustering algorithms, which might explain its popularity. The k-means algorithm belongs to the category of prototype-based clustering. We will discuss two other categories of clustering, hierarchical and density-based clustering, later in this chapter. Prototype-based clustering means that each cluster is represented by a prototype, which can either be the centroid (average) of similar points with continuous features, or the medoid (the most representative or most frequently occurring point) in the case of categorical features. While k-means is very good at identifying clusters of spherical shape, one of the drawbacks of this clustering algorithm is that we have to specify the number of clusters k a priori. An inappropriate choice for k can result in poor clustering performance. Later in this chapter, we will discuss the elbow method and silhouette plots, which are useful techniques to evaluate the quality of a clustering to help us determine the optimal number of clusters k.
Although k-means clustering can be applied to data in higher dimensions, we will walk through the following examples using a simple two-dimensional dataset for the purpose of visualization:
>>> from sklearn.datasets import make_blobs >>> X, y = make_blobs(n_samples=150, ... n_features=2, ... centers=3, ... cluster_std=0.5, ... shuffle=True, ... random_state=0) >>> import matplotlib.pyplot as plt >>> plt.scatter(X[:,0], ... X[:,1], ... c='white', ... marker='o', ... s=50) >>> plt.grid() >>> plt.show()
The dataset that we just created consists of 150 randomly generated points that are roughly grouped into three regions with higher density, which is visualized via a two-dimensional scatterplot:
In real-world applications of clustering, we do not have any ground truth category information about those samples; otherwise, it would fall into the category of supervised learning. Thus, our goal is to group the samples based on their feature similarities, which we can be achieved using the k-means algorithm that can be summarized by the following four steps:
Now the next question is how do we measure similarity between objects? We can define similarity as the opposite of distance, and a commonly used distance for clustering samples with continuous features is the squared Euclidean distance between two points x and y in m-dimensional space:
Note that, in the preceding equation, the index j refers to the jth dimension (feature column) of the sample points x and y. In the rest of this section, we will use the superscripts i and j to refer to the sample index and cluster index, respectively.
Based on this Euclidean distance metric, we can describe the k-means algorithm as a simple optimization problem, an iterative approach for minimizing the within-cluster sum of squared errors (SSE), which is sometimes also called cluster inertia:
Here, is the representative point (centroid) for cluster j, and if the sample is in cluster j; otherwise.
Now that you have learned how the simple k-means algorithm works, let's apply it to our sample dataset using the KMeans
class from scikit-learn's cluster
module:
>>> from sklearn.cluster import KMeans >>> km = KMeans(n_clusters=3, ... init='random', ... n_init=10, ... max_iter=300, ... tol=1e-04, ... random_state=0) >>> y_km = km.fit_predict(X)
Using the preceding code, we set the number of desired clusters to 3
; specifying the number of clusters a priori is one of the limitations of k-means. We set n_init=10
to run the k-means clustering algorithms 10 times independently with different random centroids to choose the final model as the one with the lowest SSE. Via the max_iter
parameter, we specify the maximum number of iterations for each single run (here, 300
). Note that the k-means implementation in scikit-learn stops early if it converges before the maximum number of iterations is reached.
However, it is possible that k-means does not reach convergence for a particular run, which can be problematic (computationally expensive) if we choose relatively large values for max_iter
. One way to deal with convergence problems is to choose larger values for tol
, which is a parameter that controls the tolerance with regard to the changes in the within-cluster sum-squared-error to declare convergence. In the preceding code, we chose a tolerance of 1e-04
(=0.0001).
So far, we discussed the classic k-means algorithm that uses a random seed to place the initial centroids, which can sometimes result in bad clusterings or slow convergence if the initial centroids are chosen poorly. One way to address this issue is to run the k-means algorithm multiple times on a dataset and choose the best performing model in terms of the SSE. Another strategy is to place the initial centroids far away from each other via the k-means++ algorithm, which leads to better and more consistent results than the classic k-means (D. Arthur and S. Vassilvitskii. k-means++: The Advantages of Careful Seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007).
The initialization in k-means++ can be summarized as follows:
Another problem with k-means is that one or more clusters can be empty. Note that this problem does not exist for k-medoids or fuzzy C-means, an algorithm that we will discuss in the next subsection. However, this problem is accounted for in the current k-means implementation in scikit-learn. If a cluster is empty, the algorithm will search for the sample that is farthest away from the centroid of the empty cluster. Then it will reassign the centroid to be this farthest point.
After we predicted the cluster labels y_km
and discussed the challenges of the k-means algorithm, let's now visualize the clusters that k-means identified in the dataset together with the cluster centroids. These are stored under the centers_
attribute of the fitted KMeans
object:
>>> plt.scatter(X[y_km==0,0], ... X[y_km ==0,1], ... s=50, ... c='lightgreen', ... marker='s', ... label='cluster 1') >>> plt.scatter(X[y_km ==1,0], ... X[y_km ==1,1], ... s=50, ... c='orange', ... marker='o', ... label='cluster 2') >>> plt.scatter(X[y_km ==2,0], ... X[y_km ==2,1], ... s=50, ... c='lightblue', ... marker='v', ... label='cluster 3') >>> plt.scatter(km.cluster_centers_[:,0], ... km.cluster_centers_[:,1], ... s=250, ... marker='*', ... c='red', ... label='centroids') >>> plt.legend() >>> plt.grid() >>> plt.show()
In the following scatterplot, we can see that k-means placed the three centroids at the center of each sphere, which looks like a reasonable grouping given this dataset:
Although k-means worked well on this toy dataset, we need to note some of the main challenges of k-means. One of the drawbacks of k-means is that we have to specify the number of clusters k a priori, which may not always be so obvious in real-world applications, especially if we are working with a higher dimensional dataset that cannot be visualized. The other properties of k-means are that clusters do not overlap and are not hierarchical, and we also assume that there is at least one item in each cluster.
Hard clustering describes a family of algorithms where each sample in a dataset is assigned to exactly one cluster, as in the k-means algorithm that we discussed in the previous subsection. In contrast, algorithms for soft clustering (sometimes also called fuzzy clustering) assign a sample to one or more clusters. A popular example of soft clustering is the fuzzy C-means (FCM) algorithm (also called soft k-means or fuzzy k-means). The original idea goes back to the 1970s where Joseph C. Dunn first proposed an early version of fuzzy clustering to improve k-means (J. C. Dunn. A Fuzzy Relative of the Isodata Process and its Use in Detecting Compact Well-separated Clusters. 1973). Almost a decade later, James C. Bedzek published his work on the improvements of the fuzzy clustering algorithm, which is now known as the FCM algorithm (J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Springer Science & Business Media, 2013).
The FCM procedure is very similar to k-means. However, we replace the hard cluster assignment by probabilities for each point belonging to each cluster. In k-means, we could express the cluster membership of a sample x by a sparse vector of binary values:
Here, the index position with value 1 indicates the cluster centroid the sample is assigned to (assuming ). In contrast, a membership vector in FCM could be represented as follows:
Here, each value falls in the range [0, 1] and represents a probability of membership to the respective cluster centroid. The sum of the memberships for a given sample is equal to 1. Similarly to the k-means algorithm, we can summarize the FCM algorithm in four key steps:
The objective function of FCM—we abbreviate it by —looks very similar to the within cluster sum-squared-error that we minimize in k-means:
However, note that the membership indicator is not a binary value as in k-means ) but a real value that denotes the cluster membership probability ). You also may have noticed that we added an additional exponent to ; the exponent m, any number greater or equal to 1 (typically m = 2), is the so-called fuzziness coefficient (or simply fuzzifier) that controls the degree of fuzziness. The larger the value of , the smaller the cluster membership becomes, which leads to fuzzier clusters. The cluster membership probability itself is calculated as follows:
For example, if we chose three cluster centers as in the previous k-means example, we could calculate the membership of the sample belonging to the cluster as:
The center of a cluster itself is calculated as the mean of all samples in the cluster weighted by the membership degree of belonging to its own cluster:
Just by looking at the equation to calculate the cluster memberships, it is intuitive to say that each iteration in FCM is more expensive than an iteration in k-means. However, FCM typically requires fewer iterations overall to reach convergence. Unfortunately, the FCM algorithm is currently not implemented in scikit-learn. However, it has been found in practice that both k-means and FCM produce very similar clustering outputs, as described in a study by Soumi Ghosh and Sanjay K. Dubey (S. Ghosh and S. K. Dubey. Comparative Analysis of k-means and Fuzzy c-means Algorithms. IJACSA, 4:35–38, 2013).
One of the main challenges in unsupervised learning is that we do not know the definitive answer. We don't have the ground truth class labels in our dataset that allow us to apply the techniques that we used in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning, in order to evaluate the performance of a supervised model. Thus, in order to quantify the quality of clustering, we need to use intrinsic metrics—such as the within-cluster SSE (distortion) that we discussed earlier in this chapter—to compare the performance of different k-means clusterings. Conveniently, we don't need to compute the within-cluster SSE explicitly as it is already accessible via the inertia_
attribute after fitting a KMeans
model:
>>> print('Distortion: %.2f' % km.inertia_) Distortion: 72.48
Based on the within-cluster SSE, we can use a graphical tool, the so-called elbow method, to estimate the optimal number of clusters k for a given task. Intuitively, we can say that, if k increases, the distortion will decrease. This is because the samples will be closer to the centroids they are assigned to. The idea behind the elbow method is to identify the value of k where the distortion begins to increase most rapidly, which will become more clear if we plot distortion for different values of k:
>>> distortions = [] >>> for i in range(1, 11): ... km = KMeans(n_clusters=i, ... init='k-means++', ... n_init=10, ... max_iter=300, ... random_state=0) >>> km.fit(X) >>> distortions.append(km.inertia_) >>> plt.plot(range(1,11), distortions, marker='o') >>> plt.xlabel('Number of clusters') >>> plt.ylabel('Distortion') >>> plt.show()
As we can see in the following plot, the elbow is located at k = 3, which provides evidence that k = 3 is indeed a good choice for this dataset:
Another intrinsic metric to evaluate the quality of a clustering is silhouette analysis, which can also be applied to clustering algorithms other than k-means that we will discuss later in this chapter. Silhouette analysis can be used as a graphical tool to plot a measure of how tightly grouped the samples in the clusters are. To calculate the silhouette coefficient of a single sample in our dataset, we can apply the following three steps:
The silhouette coefficient is bounded in the range -1 to 1. Based on the preceding formula, we can see that the silhouette coefficient is 0 if the cluster separation and cohesion are equal (). Furthermore, we get close to an ideal silhouette coefficient of 1 if , since quantifies how dissimilar a sample is to other clusters, and tells us how similar it is to the other samples in its own cluster, respectively.
The silhouette coefficient is available as silhouette_samples
from scikit-learn's metric
module, and optionally the silhouette_scores
can be imported. This calculates the average silhouette coefficient across all samples, which is equivalent to numpy.mean(silhouette_samples(…))
. By executing the following code, we will now create a plot of the silhouette coefficients for a k-means clustering with :
>>> km = KMeans(n_clusters=3, ... init='k-means++', ... n_init=10, ... max_iter=300, ... tol=1e-04, ... random_state=0) >>> y_km = km.fit_predict(X) >>> import numpy as np >>> from matplotlib import cm >>> from sklearn.metrics import silhouette_samples >>> cluster_labels = np.unique(y_km) >>> n_clusters = cluster_labels.shape[0] >>> silhouette_vals = silhouette_samples(X, ... y_km, ... metric='euclidean') >>> y_ax_lower, y_ax_upper = 0, 0 >>> yticks = [] >>> for i, c in enumerate(cluster_labels): ... c_silhouette_vals = silhouette_vals[y_km == c] ... c_silhouette_vals.sort() ... y_ax_upper += len(c_silhouette_vals) ... color = cm.jet(i / n_clusters) ... plt.barh(range(y_ax_lower, y_ax_upper), ... c_silhouette_vals, ... height=1.0, ... edgecolor='none', ... color=color) ... yticks.append((y_ax_lower + y_ax_upper) / 2) ... y_ax_lower += len(c_silhouette_vals) >>> silhouette_avg = np.mean(silhouette_vals) >>> plt.axvline(silhouette_avg, ... color="red", ... linestyle="--") >>> plt.yticks(yticks, cluster_labels + 1) >>> plt.ylabel('Cluster') >>> plt.xlabel('Silhouette coefficient') >>> plt.show()
Through a visual inspection of the silhouette plot, we can quickly scrutinize the sizes of the different clusters and identify clusters that contain outliers:
As we can see in the preceding silhouette plot, our silhouette coefficients are not even close to 0, which can be an indicator of a good clustering. Furthermore, to summarize the goodness of our clustering, we added the average silhouette coefficient to the plot (dotted line).
To see how a silhouette plot looks for a relatively bad clustering, let's seed the k-means algorithm with two centroids only:
>>> km = KMeans(n_clusters=2, ... init='k-means++', ... n_init=10, ... max_iter=300, ... tol=1e-04, ... random_state=0) >>> y_km = km.fit_predict(X) >>> plt.scatter(X[y_km==0,0], ... X[y_km==0,1], ... s=50, c='lightgreen', ... marker='s', ... label='cluster 1') >>> plt.scatter(X[y_km==1,0], ... X[y_km==1,1], ... s=50, ... c='orange', ... marker='o', ... label='cluster 2') >>> plt.scatter(km.cluster_centers_[:,0], ... km.cluster_centers_[:,1], ... s=250, ... marker='*', ... c='red', ... label='centroids') >>> plt.legend() >>> plt.grid() >>> plt.show()
As we can see in the following scatterplot, one of the centroids falls between two of the three spherical groupings of the sample points. Although the clustering does not look completely terrible, it is suboptimal.
Next we create the silhouette plot to evaluate the results. Please keep in mind that we typically do not have the luxury of visualizing datasets in two-dimensional scatterplots in real-world problems, since we typically work with data in higher dimensions:
>>> cluster_labels = np.unique(y_km) >>> n_clusters = cluster_labels.shape[0] >>> silhouette_vals = silhouette_samples(X, ... y_km, ... metric='euclidean') >>> y_ax_lower, y_ax_upper = 0, 0 yticks = [] >>> for i, c in enumerate(cluster_labels): ... c_silhouette_vals = silhouette_vals[y_km == c] ... c_silhouette_vals.sort() ... y_ax_upper += len(c_silhouette_vals) ... color = cm.jet(i / n_clusters) ... plt.barh(range(y_ax_lower, y_ax_upper), ... c_silhouette_vals, ... height=1.0, ... edgecolor='none', ... color=color) ... yticks.append((y_ax_lower + y_ax_upper) / 2) ... y_ax_lower += len(c_silhouette_vals) >>> silhouette_avg = np.mean(silhouette_vals) >>> plt.axvline(silhouette_avg, color="red", linestyle="--") >>> plt.yticks(yticks, cluster_labels + 1) >>> plt.ylabel('Cluster') >>> plt.xlabel('Silhouette coefficient') >>> plt.show()
As we can see in the resulting plot, the silhouettes now have visibly different lengths and width, which yields further evidence for a suboptimal clustering: