How does K-means work?

A clustering algorithm, such as K-means, locates the centroid of the group of data points. However, to make clustering accurate and effective, the algorithm evaluates the distance between each point from the centroid of the cluster.

Eventually, the goal of clustering is to determine intrinsic grouping in a set of unlabeled data. For example, the K-means algorithm tries to cluster related data points within the predefined three (that is, k = 3) clusters as shown in Figure 8:

Figure 8: The results of a typical clustering algorithm and a representation of the cluster centers

In our case, using a combined approach of Spark, ADAM and H2O are capable of processing large amounts of variant data points. Suppose, we have n data points (xi, i=1, 2… n, example, genetic variants) that need to be partitioned into k clusters. Then K-means assigns a cluster to each data point and aiming to find the positions μii=1...k of the clusters that minimize the distance from the data points to the cluster. Mathematically, K-means tries to achieve the goal by solving an equation—that is, an optimization problem:

In the preceding equation, ci is the set of data points that assigned to cluster i and d(x,μi)=∥x−μi22 is the Euclidean distance to be calculated. The algorithm computes this distance between data points and the center of the k clusters by minimizing the Within-Cluster Sum of Squares (that is, WCSS), where ci is the set of points belonging to cluster i.

Therefore, we can understand that the overall clustering operation using K-means is not a trivial one but an NP-hard optimization problem. Which also means that K-means algorithm not only tries to find the global minima but also often is stuck in different solutions. The K-means algorithm proceeds by alternating between two steps:

  • Cluster assignment step: Assign each observation to the cluster whose mean yields the least WCSS. The sum of squares is the squared Euclidean distance.
  • Centroid update step: Calculate the new means to be the centroids of the observations in the new clusters.

In a nutshell, the overall approach of K-means training can be described in following figure:

Figure 9: Overall approach of the K-means algorithm process
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset