The K-means algorithm

The K-means algorithm is a clustering algorithm designed in 1967 by MacQueen which allows the dividing of groups of objects into K partitions based on their attributes. It is a variation of the expectation-maximization (EM) algorithm, whose goal is to determine the K data groups generated by Gaussian distributions. The K-means algorithm differs in the method used for calculating the Euclidean distance while calculating the distance between each of two data items; EM uses statistical methods.

In K-means, it is assumed that object attributes can be represented as vectors and thus form a vector space. The goal is to minimize the total intra-cluster variance (or standard deviation). Each cluster is identified by a centroid.

The algorithm follows an iterative procedure:

  1. Choose the number of clusters K.
  2. Initially create K partitions and assign each entry partition either randomly or using some heuristic information.
  3. Calculate the centroid of each group.
  4. Calculate the distance between each observation and each cluster centroid.
  5. Then construct a new partition by associating each entry point with the cluster whose centroid is closer to it.
  6. The centroid for new clusters is recalculated.
  7. Repeat steps 4 through 6 until the algorithm converges.

The purpose of the algorithm is to locate k centroids, one for each cluster. The position of each centroid is of particular importance and different positions cause different results. The best choice is to put them as far apart as possible from each other. When this is done, you must associate each object with the nearest centroid. In this way, we will get a first grouping. After finishing the first cycle, we go to the next one by recalculating new k centroids as the cluster's barycentres resulting from the previous one. Once you locate these new k centroids, you need to make a new connection between the same datasets and the new closest centroid. At the end of these operations, a new cycle is performed. As a result of this cycle, we can note that the k centroids change their position step by step until they are modified. So, the centroid does not move anymore. In the following figure, k centroids of the data distribution are shown:

Figure 6.10: k centroids of the data distribution
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset