Determining the number of optimal clusters

The beauty of clustering algorithms such as K-means is that they do the clustering on the data with an unlimited number of features. They are great tools to use when you have raw data and would like to know the patterns in that data. However, deciding on the number of clusters prior doing the experiment might not be successful and may sometimes lead to an overfitting or underfitting problem.

On the other hand, one common thing to all three algorithms (that is, K-means, bisecting K-means, and Gaussian mixture) is that the number of clusters must be determined in advance and supplied to the algorithm as a parameter. Hence, informally, determining the number of clusters is a separate optimization problem to be solved.

Now we will use a heuristic approach based on the Elbow method. We start from K = 2 clusters, then we run the K-means algorithm for the same dataset by increasing K and observing the value of the cost function WCSS:

val iterations = 20
for (i <- 2 to iterations) {
val kmeans = new KMeans().setK(i).setSeed(12345L)
val model = kmeans.fit(pcaDF)
val WSSSE = model.computeCost(pcaDF)
println("Within-Cluster Sum of Squares for k = " + i + " is " +
WSSSE)
}

At some point, a big drop in cost function can be observed, but then the improvement became marginal with the increasing value of k. As suggested in cluster analysis literature, we can pick the k after the last big drop of WCSS as an optimal one. Now, let's see the WCSS values for a different number of clusters between 2 and 20 for example:

Within-Cluster Sum of Squares for k = 2 is 453.161838161838
Within-Cluster Sum of Squares for k = 3 is 438.2392344497606
Within-Cluster Sum of Squares for k = 4 is 390.2278787878787
Within-Cluster Sum of Squares for k = 5 is 397.72112098427874
Within-Cluster Sum of Squares for k = 6 is 367.8890909090908
Within-Cluster Sum of Squares for k = 7 is 362.3360347662672
Within-Cluster Sum of Squares for k = 8 is 347.49306362861336
Within-Cluster Sum of Squares for k = 9 is 327.5002901103624
Within-Cluster Sum of Squares for k = 10 is 327.29376873556436
Within-Cluster Sum of Squares for k = 11 is 315.2954156954155
Within-Cluster Sum of Squares for k = 12 is 320.2478696814693
Within-Cluster Sum of Squares for k = 13 is 308.7674242424241
Within-Cluster Sum of Squares for k = 14 is 314.64784054938576
Within-Cluster Sum of Squares for k = 15 is 297.38523698523704
Within-Cluster Sum of Squares for k = 16 is 294.26114718614707
Within-Cluster Sum of Squares for k = 17 is 284.34890572390555
Within-Cluster Sum of Squares for k = 18 is 280.35662525879917
Within-Cluster Sum of Squares for k = 19 is 272.765762015762
Within-Cluster Sum of Squares for k = 20 is 272.05702362771336

Now let us discuss how to take advantage of the Elbow method for determining the number of clusters. As shown next, we calculated the cost function, WCSS, as a function of a number of clusters for the K-means algorithm applied to Y chromosome genetic variants from the selected population groups.

It can be observed that a somewhat big drop occurs when k = 9 (which is not a drastic drop though). Therefore, we choose the number of clusters to be 10, as shown in Figure 10:

Figure 19: Number of clusters as a function of WCSS
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset