We defined machine learning as the design and study of systems that learn from experience to improve their performance of a task as measured by a given metric. K-Means is an unsupervised learning algorithm; there are no labels or ground truth to compare with the clusters. However, we can still evaluate the performance of the algorithm using intrinsic measures. We have already discussed measuring the distortions of the clusters. In this section, we will discuss another performance measure for clustering called the silhouette coefficient. The silhouette coefficient is a measure of the compactness and separation of the clusters. It increases as the quality of the clusters increase; it is large for compact clusters that are far from each other and small for large, overlapping clusters. The silhouette coefficient is calculated per instance; for a set of instances, it is calculated as the mean of the individual samples' scores. The silhouette coefficient for an instance is calculated with the following equation:
a is the mean distance between the instances in the cluster. b is the mean distance between the instance and the instances in the next closest cluster. The following example runs K-Means four times to create two, three, four, and eight clusters from a toy dataset and calculates the silhouette coefficient for each run:
>>> import numpy as np >>> from sklearn.cluster import KMeans >>> from sklearn import metrics >>> import matplotlib.pyplot as plt >>> plt.subplot(3, 2, 1) >>> x1 = np.array([1, 2, 3, 1, 5, 6, 5, 5, 6, 7, 8, 9, 7, 9]) >>> x2 = np.array([1, 3, 2, 2, 8, 6, 7, 6, 7, 1, 2, 1, 1, 3]) >>> X = np.array(zip(x1, x2)).reshape(len(x1), 2) >>> plt.xlim([0, 10]) >>> plt.ylim([0, 10]) >>> plt.title('Instances') >>> plt.scatter(x1, x2) >>> colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k', 'b'] >>> markers = ['o', 's', 'D', 'v', '^', 'p', '*', '+'] >>> tests = [2, 3, 4, 5, 8] >>> subplot_counter = 1 >>> for t in tests: >>> subplot_counter += 1 >>> plt.subplot(3, 2, subplot_counter) >>> kmeans_model = KMeans(n_clusters=t).fit(X) >>> for i, l in enumerate(kmeans_model.labels_): >>> plt.plot(x1[i], x2[i], color=colors[l], marker=markers[l], ls='None') >>> plt.xlim([0, 10]) >>> plt.ylim([0, 10]) >>> plt.title('K = %s, silhouette coefficient = %.03f' % ( >>> t, metrics.silhouette_score(X, kmeans_model.labels_, metric='euclidean'))) >>> plt.show()
This script produces the following figure:
The dataset contains three obvious clusters. Accordingly, the silhouette coefficient is greatest when is equal to three. Setting equal to eight produces clusters of instances that are as close to each other as they are to the instances in some of the other clusters, and the silhouette coefficient of these clusters is smallest.