k-means clustering algorithm on household income example

We will apply k-clustering algorithm on the household income example. In the beginning we have households with the incomes 40k, 55k, 70k, 100k, 115k, 130k and 135k in USD dollars.

The first centroid to be picked up can be any feature, example 70k. The second centroid should be the feature that is furthest from the first one, that is 135k since 135k-70k is 65k which is the greatest difference between any other feature and 70k. Thus 70k is the centroid of the first cluster, 135k is the centroid of the second cluster.

Now 40k, 55k, 70k, 100k are closer to 70k by taking the difference than to 135k, so they will be in the first cluster. The features 115k, 130k and 135k are closer to 135k than to 70k, so they will be in the second cluster.

After we have classified the features according to the initial centroids, we recompute the centroids. The centroid of the first cluster is (1/4)*( 40k+55k+70k+100k)=(1/4)*265k=66.25k.

The centroid of the second cluster is (1/3)*(115k+130k+135k)=(1/3)*380k~126.66k.

Using the new centroids we reclassify the features as follows:

  • The first cluster with the centroid 66.25k will contain the features 40k, 55k, 70k.
  • The second cluster with the centroid 126.66k will contain the features 100k, 115k, 130k, 135k.

We notice that the feature 100k moved from the first cluster into the second since now it is closer to the centroid of the second cluster (distance |100k-126.66k|=26.66k) than to the centroid of the first cluster (distance |100k-66.25k|=33.75k). Since the features in the clusters changed, we have to recompute the centroids again.

The centroid of the first cluster is (1/3)*(40k+55k+70k)=(1/3)/165k=55k. The centroid of the second cluster is (1/4)*(100k+115k+130k+135k)=(1/4)*480k=120k.

Using these centroids we reclassify the items into the clusters. The first centroid 55k will contain the features 40k, 55k, 70k. The second centroid 120k will contain the features 100k, 115k, 130k, 135k. Thus upon the update of the centroids, the clusters did not change. So their centroids will remain the same.

Therefore the algorithm terminates with the two clusters: the first cluster having the features 40k, 55k, 70k; the second cluster having the features 100k, 115k, 130k, 135k.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset