K-means clustering

As we did with hierarchical clustering, we can also use NbClust() to determine the optimum number of clusters for k-means. All you need to do is specify kmeans as the method in the function. Let's also loosen up the maximum number of clusters to 15. I've abbreviated the following output to just the majority rules portion:

    > numKMeans <- NbClust(df, min.nc = 2, max.nc = 15, method = 

* Among all indices:
* 4 proposed 2 as the best number of clusters
* 15 proposed 3 as the best number of clusters
* 1 proposed 10 as the best number of clusters
* 1 proposed 12 as the best number of clusters
* 1 proposed 14 as the best number of clusters
* 1 proposed 15 as the best number of clusters

***** Conclusion *****

* According to the majority rule, the best number of clusters is 3

Once again, three clusters appear to be the optimum solution. Here is the Hubert plot, which confirms this:

In R, we can use the kmeans() function to do this analysis. In addition to the input data, we have to specify the number of clusters we are solving for and a value for random assignments, the nstart argument. We will also need to specify a random seed:

    > set.seed(1234)

> km <- kmeans(df, 3, nstart = 25)

Creating a table of the clusters gives us a sense of the distribution of the observations between them:

    > table(km$cluster)

1 2 3
62 65 51

The number of observations per cluster is well-balanced. I have seen on a number of occasions with larger datasets and many more variables that no number of k-means yields a promising and compelling result. Another way to analyze the clustering is to look at a matrix of the cluster centers for each variable in each cluster:

    > km$centers
Alcohol MalicAcid Ash Alk_ash magnesium T_phenols
1 0.8328826 -0.3029551 0.3636801 -0.6084749 0.57596208 0.88274724
2 -0.9234669 -0.3929331 -0.4931257 0.1701220 -0.49032869 -0.07576891
3 0.1644436 0.8690954 0.1863726 0.5228924 -0.07526047 -0.97657548
Flavanoids Non_flav Proantho C_Intensity Hue OD280_315
1 0.97506900 -0.56050853 0.57865427 0.1705823 0.4726504 0.7770551
2 0.02075402 -0.03343924 0.05810161 -0.8993770 0.4605046 0.2700025
3 -1.21182921 0.72402116 -0.77751312 0.9388902 -1.1615122 -1.2887761
1 1.1220202
2 -0.7517257
3 -0.4059428

Note that cluster one has, on average, a higher alcohol content. Let's produce a boxplot to look at the distribution of alcohol content in the same manner as we did before and also compare it to Ward's:

> boxplot(wine$Alcohol ~ km$cluster, data = wine,
main = "Alcohol Content, K-Means")

> boxplot(wine$Alcohol ~ ward3, data = wine,
main = "Alcohol Content, Ward's")

The alcohol content for each cluster is almost exactly the same. On the surface, this tells me that three clusters is the proper latent structure for the wines and there is little difference between using k-means or hierarchical clustering. Finally, let's do the comparison of the k-means clusters versus the cultivars:

    > table(km$cluster, wine$Class)

1 2 3
1 59 3 0
2 0 65 0
3 0 3 48

This is very similar to the distribution produced by Ward's method, and either one would probably be acceptable to our hypothetical sommelier.

However, to demonstrate how you can cluster on data with both numeric and non-numeric values, let's work through some more examples.

