Chapter 6. Clustering with K-Means

In the previous chapters we discussed supervised learning tasks; we examined algorithms for regression and classification that learned from labeled training data. In this chapter we will discuss an unsupervised learning task called clustering. Clustering is used to find groups of similar observations within a set of unlabeled data. We will discuss the K-Means clustering algorithm, apply it to an image compression problem, and learn to measure its performance. Finally, we will work through a semi-supervised learning problem that combines clustering with classification.

Recall from Chapter 1, The Fundamentals of Machine Learning, that the goal of unsupervised learning is to discover hidden structure or patterns in unlabeled training data. Clustering, or cluster analysis, is the task of grouping observations such that members of the same group, or cluster, are more similar to each other by a given metric than they are to the members of the other clusters. As with supervised learning, we will represent an observation as an n-dimensional vector. For example, assume that your training data consists of the samples plotted in the following figure:

Clustering with K-Means

Clustering might reveal the following two groups, indicated by squares and circles:

Clustering with K-Means

Clustering could also reveal the following four groups:

Clustering with K-Means

Clustering is commonly used to explore a dataset. Social networks can be clustered to identify communities and to suggest missing connections between people. In biology, clustering is used to find groups of genes with similar expression patterns. Recommendation systems sometimes employ clustering to identify products or media that might appeal to a user. In marketing, clustering is used to find segments of similar consumers. In the following sections, we will work through an example of using the K-Means algorithm to cluster a dataset.

Clustering with the K-Means algorithm

The K-Means algorithm is a clustering method that is popular because of its speed and scalability. K-Means is an iterative process of moving the centers of the clusters, or the centroids, to the mean position of their constituent points, and re-assigning instances to their closest clusters. The titular Clustering with the K-Means algorithm is a hyperparameter that specifies the number of clusters that should be created; K-Means automatically assigns observations to clusters but cannot determine the appropriate number of clusters. Clustering with the K-Means algorithm must be a positive integer that is less than the number of instances in the training set. Sometimes, the number of clusters is specified by the clustering problem's context. For example, a company that manufactures shoes might know that it is able to support manufacturing three new models. To understand what groups of customers to target with each model, it surveys customers and creates three clusters from the results. That is, the value of Clustering with the K-Means algorithm was specified by the problem's context. Other problems may not require a specific number of clusters, and the optimal number of clusters may be ambiguous. We will discuss a heuristic to estimate the optimal number of clusters called the elbow method later in this chapter.

The parameters of K-Means are the positions of the clusters' centroids and the observations that are assigned to each cluster. Like generalized linear models and decision trees, the optimal values of K-Means' parameters are found by minimizing a cost function. The cost function for K-Means is given by the following equation:

Clustering with the K-Means algorithm

In the preceding equation, Clustering with the K-Means algorithm is the centroid for the cluster Clustering with the K-Means algorithm. The cost function sums the distortions of the clusters. Each cluster's distortion is equal to the sum of the squared distances between its centroid and its constituent instances. The distortion is small for compact clusters and large for clusters that contain scattered instances. The parameters that minimize the cost function are learned through an iterative process of assigning observations to clusters and then moving the clusters. First, the clusters' centroids are initialized to random positions. In practice, setting the centroids' positions equal to the positions of randomly selected observations yields the best results. During each iteration, K-Means assigns observations to the cluster that they are closest to, and then moves the centroids to their assigned observations' mean location. Let's work through an example by hand using the training data shown in the following table:

Instance

X0

X1

1

7

5

2

5

7

3

7

7

4

3

3

5

4

6

6

1

4

7

0

0

8

2

2

9

8

7

10

6

8

11

5

5

12

3

7

There are two explanatory variables and each instance has two features. The instances are plotted in the following figure:

Clustering with the K-Means algorithm

Assume that K-Means initializes the centroid for the first cluster to the fifth instance and the centroid for the second cluster to the eleventh instance. For each instance, we will calculate its distance to both centroids, and assign it to the cluster with the closest centroid. The initial assignments are shown in the Cluster column of the following table:

Instance

X0

X1

C1 distance

C2 distance

Last cluster

Cluster

Changed?

1

7

5

3.16228

2

None

C2

Yes

2

5

7

1.41421

2

None

C1

Yes

3

7

7

3.16228

2.82843

None

C2

Yes

4

3

3

3.16228

2.82843

None

C2

Yes

5

4

6

0

1.41421

None

C1

Yes

6

1

4

3.60555

4.12311

None

C1

Yes

7

0

0

7.21110

7.07107

None

C2

Yes

8

2

2

4.47214

4.24264

None

C2

Yes

9

8

7

4.12311

3.60555

None

C2

Yes

10

6

8

2.82843

3.16228

None

C1

Yes

11

5

5

1.41421

0

None

C2

Yes

12

3

7

1.41421

2.82843

None

C1

Yes

C1 centroid

4

6

     

C2 centroid

5

5

     

The plotted centroids and the initial cluster assignments are shown in the following graph. Instances assigned to the first cluster are marked with Xs, and instances assigned to the second cluster are marked with dots. The markers for the centroids are larger than the markers for the instances.

Clustering with the K-Means algorithm

Now we will move both centroids to the means of their constituent instances, recalculate the distances of the training instances to the centroids, and reassign the instances to the closest centroids:

Instance

X0

X1

C1 distance

C2 distance

Last Cluster

New Cluster

Changed?

1

7

5

3.492850

2.575394

C2

C2

No

2

5

7

1.341641

2.889107

C1

C1

No

3

7

7

3.255764

3.749830

C2

C1

Yes

4

3

3

3.492850

1.943067

C2

C2

No

5

4

6

0.447214

1.943067

C1

C1

No

6

1

4

3.687818

3.574285

C1

C2

Yes

7

0

0

7.443118

6.169378

C2

C2

No

8

2

2

4.753946

3.347250

C2

C2

No

9

8

7

4.242641

4.463000

C2

C1

Yes

10

6

8

2.720294

4.113194

C1

C1

No

11

5

5

1.843909

0.958315

C2

C2

No

12

3

7

1

3.260775

C1

C1

No

C1 centroid

3.8

6.4

     

C2 centroid

4.571429

4.142857

     

The new clusters are plotted in the following graph. Note that the centroids are diverging and several instances have changed their assignments:

Clustering with the K-Means algorithm

Now, we will move the centroids to the means of their constituents' locations again and reassign the instances to their nearest centroids. The centroids continue to diverge, as shown in the following figure:

Clustering with the K-Means algorithm

None of the instances' centroid assignments will change in the next iteration; K-Means will continue iterating until some stopping criteria is satisfied. Usually, this criterion is either a threshold for the difference between the values of the cost function for subsequent iterations, or a threshold for the change in the positions of the centroids between subsequent iterations. If these stopping criteria are small enough, K-Means will converge on an optimum. This optimum will not necessarily be the global optimum.

Local optima

Recall that K-Means initially sets the positions of the clusters' centroids to the positions of randomly selected observations. Sometimes, the random initialization is unlucky and the centroids are set to positions that cause K-Means to converge to a local optimum. For example, assume that K-Means randomly initializes two cluster centroids to the following positions:

Local optima

K-Means will eventually converge on a local optimum like that shown in the following figure. These clusters may be informative, but it is more likely that the top and bottom groups of observations are more informative clusters. To avoid local optima, K-Means is often repeated dozens or even hundreds of times. In each iteration, it is randomly initialized to different starting cluster positions. The initialization that minimizes the cost function best is selected.

Local optima

The elbow method

If The elbow method is not specified by the problem's context, the optimal number of clusters can be estimated using a technique called the elbow method. The elbow method plots the value of the cost function produced by different values of The elbow method. As The elbow method increases, the average distortion will decrease; each cluster will have fewer constituent instances, and the instances will be closer to their respective centroids. However, the improvements to the average distortion will decline as The elbow method increases. The value of The elbow method at which the improvement to the distortion declines the most is called the elbow. Let's use the elbow method to choose the number of clusters for a dataset. The following scatter plot visualizes a dataset with two obvious clusters:

The elbow method

We will calculate and plot the mean distortion of the clusters for each value of The elbow method from 1 to 10 with the following code:

>>> import numpy as np 
>>> from sklearn.cluster import KMeans 
>>> from scipy.spatial.distance import cdist 
>>> import matplotlib.pyplot as plt 

>>> cluster1 = np.random.uniform(0.5, 1.5, (2, 10))
>>> cluster2 = np.random.uniform(3.5, 4.5, (2, 10))
>>> X = np.hstack((cluster1, cluster2)).T
>>> X = np.vstack((x, y)).T 

>>> K = range(1, 10) 
>>> meandistortions = [] 
>>> for k in K: 
>>>     kmeans = KMeans(n_clusters=k) 
>>>     kmeans.fit(X) 
>>>     meandistortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0]) 

>>> plt.plot(K, meandistortions, 'bx-') 
>>> plt.xlabel('k') 
>>> plt.ylabel('Average distortion') 
>>> plt.title('Selecting k with the Elbow Method') 
>>> plt.show()
The elbow method

The average distortion improves rapidly as we increase The elbow method from 1 to 2. There is little improvement for values of The elbow method greater than 2. Now let's use the elbow method on the following dataset with three clusters:

The elbow method

The following figure shows the elbow plot for the dataset. From this, we can see that the rate of improvement to the average distortion declines the most when adding a fourth cluster, that is, the elbow method confirms that The elbow method should be set to three for this dataset.

The elbow method
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset