In the previous chapters we discussed supervised learning tasks; we examined algorithms for regression and classification that learned from labeled training data. In this chapter we will discuss an unsupervised learning task called clustering. Clustering is used to find groups of similar observations within a set of unlabeled data. We will discuss the K-Means clustering algorithm, apply it to an image compression problem, and learn to measure its performance. Finally, we will work through a semi-supervised learning problem that combines clustering with classification.
Recall from Chapter 1, The Fundamentals of Machine Learning, that the goal of unsupervised learning is to discover hidden structure or patterns in unlabeled training data. Clustering, or cluster analysis, is the task of grouping observations such that members of the same group, or cluster, are more similar to each other by a given metric than they are to the members of the other clusters. As with supervised learning, we will represent an observation as an n-dimensional vector. For example, assume that your training data consists of the samples plotted in the following figure:
Clustering might reveal the following two groups, indicated by squares and circles:
Clustering could also reveal the following four groups:
Clustering is commonly used to explore a dataset. Social networks can be clustered to identify communities and to suggest missing connections between people. In biology, clustering is used to find groups of genes with similar expression patterns. Recommendation systems sometimes employ clustering to identify products or media that might appeal to a user. In marketing, clustering is used to find segments of similar consumers. In the following sections, we will work through an example of using the K-Means algorithm to cluster a dataset.
The K-Means algorithm is a clustering method that is popular because of its speed and scalability. K-Means is an iterative process of moving the centers of the clusters, or the centroids, to the mean position of their constituent points, and re-assigning instances to their closest clusters. The titular is a hyperparameter that specifies the number of clusters that should be created; K-Means automatically assigns observations to clusters but cannot determine the appropriate number of clusters. must be a positive integer that is less than the number of instances in the training set. Sometimes, the number of clusters is specified by the clustering problem's context. For example, a company that manufactures shoes might know that it is able to support manufacturing three new models. To understand what groups of customers to target with each model, it surveys customers and creates three clusters from the results. That is, the value of was specified by the problem's context. Other problems may not require a specific number of clusters, and the optimal number of clusters may be ambiguous. We will discuss a heuristic to estimate the optimal number of clusters called the elbow method later in this chapter.
The parameters of K-Means are the positions of the clusters' centroids and the observations that are assigned to each cluster. Like generalized linear models and decision trees, the optimal values of K-Means' parameters are found by minimizing a cost function. The cost function for K-Means is given by the following equation:
In the preceding equation, is the centroid for the cluster . The cost function sums the distortions of the clusters. Each cluster's distortion is equal to the sum of the squared distances between its centroid and its constituent instances. The distortion is small for compact clusters and large for clusters that contain scattered instances. The parameters that minimize the cost function are learned through an iterative process of assigning observations to clusters and then moving the clusters. First, the clusters' centroids are initialized to random positions. In practice, setting the centroids' positions equal to the positions of randomly selected observations yields the best results. During each iteration, K-Means assigns observations to the cluster that they are closest to, and then moves the centroids to their assigned observations' mean location. Let's work through an example by hand using the training data shown in the following table:
Instance |
X0 |
X1 |
---|---|---|
1 |
7 |
5 |
2 |
5 |
7 |
3 |
7 |
7 |
4 |
3 |
3 |
5 |
4 |
6 |
6 |
1 |
4 |
7 |
0 |
0 |
8 |
2 |
2 |
9 |
8 |
7 |
10 |
6 |
8 |
11 |
5 |
5 |
12 |
3 |
7 |
There are two explanatory variables and each instance has two features. The instances are plotted in the following figure:
Assume that K-Means initializes the centroid for the first cluster to the fifth instance and the centroid for the second cluster to the eleventh instance. For each instance, we will calculate its distance to both centroids, and assign it to the cluster with the closest centroid. The initial assignments are shown in the Cluster column of the following table:
Instance |
X0 |
X1 |
C1 distance |
C2 distance |
Last cluster |
Cluster |
Changed? |
---|---|---|---|---|---|---|---|
1 |
7 |
5 |
3.16228 |
2 |
None |
C2 |
Yes |
2 |
5 |
7 |
1.41421 |
2 |
None |
C1 |
Yes |
3 |
7 |
7 |
3.16228 |
2.82843 |
None |
C2 |
Yes |
4 |
3 |
3 |
3.16228 |
2.82843 |
None |
C2 |
Yes |
5 |
4 |
6 |
0 |
1.41421 |
None |
C1 |
Yes |
6 |
1 |
4 |
3.60555 |
4.12311 |
None |
C1 |
Yes |
7 |
0 |
0 |
7.21110 |
7.07107 |
None |
C2 |
Yes |
8 |
2 |
2 |
4.47214 |
4.24264 |
None |
C2 |
Yes |
9 |
8 |
7 |
4.12311 |
3.60555 |
None |
C2 |
Yes |
10 |
6 |
8 |
2.82843 |
3.16228 |
None |
C1 |
Yes |
11 |
5 |
5 |
1.41421 |
0 |
None |
C2 |
Yes |
12 |
3 |
7 |
1.41421 |
2.82843 |
None |
C1 |
Yes |
C1 centroid |
4 |
6 | |||||
C2 centroid |
5 |
5 |
The plotted centroids and the initial cluster assignments are shown in the following graph. Instances assigned to the first cluster are marked with Xs, and instances assigned to the second cluster are marked with dots. The markers for the centroids are larger than the markers for the instances.
Now we will move both centroids to the means of their constituent instances, recalculate the distances of the training instances to the centroids, and reassign the instances to the closest centroids:
Instance |
X0 |
X1 |
C1 distance |
C2 distance |
Last Cluster |
New Cluster |
Changed? |
---|---|---|---|---|---|---|---|
1 |
7 |
5 |
3.492850 |
2.575394 |
C2 |
C2 |
No |
2 |
5 |
7 |
1.341641 |
2.889107 |
C1 |
C1 |
No |
3 |
7 |
7 |
3.255764 |
3.749830 |
C2 |
C1 |
Yes |
4 |
3 |
3 |
3.492850 |
1.943067 |
C2 |
C2 |
No |
5 |
4 |
6 |
0.447214 |
1.943067 |
C1 |
C1 |
No |
6 |
1 |
4 |
3.687818 |
3.574285 |
C1 |
C2 |
Yes |
7 |
0 |
0 |
7.443118 |
6.169378 |
C2 |
C2 |
No |
8 |
2 |
2 |
4.753946 |
3.347250 |
C2 |
C2 |
No |
9 |
8 |
7 |
4.242641 |
4.463000 |
C2 |
C1 |
Yes |
10 |
6 |
8 |
2.720294 |
4.113194 |
C1 |
C1 |
No |
11 |
5 |
5 |
1.843909 |
0.958315 |
C2 |
C2 |
No |
12 |
3 |
7 |
1 |
3.260775 |
C1 |
C1 |
No |
C1 centroid |
3.8 |
6.4 | |||||
C2 centroid |
4.571429 |
4.142857 |
The new clusters are plotted in the following graph. Note that the centroids are diverging and several instances have changed their assignments:
Now, we will move the centroids to the means of their constituents' locations again and reassign the instances to their nearest centroids. The centroids continue to diverge, as shown in the following figure:
None of the instances' centroid assignments will change in the next iteration; K-Means will continue iterating until some stopping criteria is satisfied. Usually, this criterion is either a threshold for the difference between the values of the cost function for subsequent iterations, or a threshold for the change in the positions of the centroids between subsequent iterations. If these stopping criteria are small enough, K-Means will converge on an optimum. This optimum will not necessarily be the global optimum.
Recall that K-Means initially sets the positions of the clusters' centroids to the positions of randomly selected observations. Sometimes, the random initialization is unlucky and the centroids are set to positions that cause K-Means to converge to a local optimum. For example, assume that K-Means randomly initializes two cluster centroids to the following positions:
K-Means will eventually converge on a local optimum like that shown in the following figure. These clusters may be informative, but it is more likely that the top and bottom groups of observations are more informative clusters. To avoid local optima, K-Means is often repeated dozens or even hundreds of times. In each iteration, it is randomly initialized to different starting cluster positions. The initialization that minimizes the cost function best is selected.
If is not specified by the problem's context, the optimal number of clusters can be estimated using a technique called the elbow method. The elbow method plots the value of the cost function produced by different values of . As increases, the average distortion will decrease; each cluster will have fewer constituent instances, and the instances will be closer to their respective centroids. However, the improvements to the average distortion will decline as increases. The value of at which the improvement to the distortion declines the most is called the elbow. Let's use the elbow method to choose the number of clusters for a dataset. The following scatter plot visualizes a dataset with two obvious clusters:
We will calculate and plot the mean distortion of the clusters for each value of from 1 to 10 with the following code:
>>> import numpy as np >>> from sklearn.cluster import KMeans >>> from scipy.spatial.distance import cdist >>> import matplotlib.pyplot as plt >>> cluster1 = np.random.uniform(0.5, 1.5, (2, 10)) >>> cluster2 = np.random.uniform(3.5, 4.5, (2, 10)) >>> X = np.hstack((cluster1, cluster2)).T >>> X = np.vstack((x, y)).T >>> K = range(1, 10) >>> meandistortions = [] >>> for k in K: >>> kmeans = KMeans(n_clusters=k) >>> kmeans.fit(X) >>> meandistortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0]) >>> plt.plot(K, meandistortions, 'bx-') >>> plt.xlabel('k') >>> plt.ylabel('Average distortion') >>> plt.title('Selecting k with the Elbow Method') >>> plt.show()
The average distortion improves rapidly as we increase from 1 to 2. There is little improvement for values of greater than 2. Now let's use the elbow method on the following dataset with three clusters:
The following figure shows the elbow plot for the dataset. From this, we can see that the rate of improvement to the average distortion declines the most when adding a fourth cluster, that is, the elbow method confirms that should be set to three for this dataset.