K-means clustering

With k-means, we will need to specify the exact number of clusters that we want. The algorithm will then iterate until each observation belongs to just one of the k-clusters. The algorithm's goal is to minimize the within-cluster variation as defined by the squared Euclidean distances. So, the kth-cluster variation is the sum of the squared Euclidean distances for all the pairwise observations divided by the number of observations in the cluster.

Due to the iteration process that is involved, one k-means result can differ greatly from another result even if you specify the same number of clusters. Let's see how this algorithm plays out:

  1. Specify the exact number of clusters you desire (k)
  2. Initialize: k observations are randomly selected as the initial means
  3. Iterate:
    • K clusters are created by assigning each observation to its closest cluster center (minimizing within-cluster sum of squares)
    • The centroid of each cluster becomes the new mean
    • This is repeated until convergence, that is, the cluster centroids do not change

As you can see, the final result will vary because of the initial assignment in step 1. Therefore, it is important to run multiple initial starts and let the software identify the best solution. In R, this can be a simple process, as we will see.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset