K-means working methodology from first principles

The k-means working methodology is illustrated in the following example in which 12 instances are considered with their X and Y values. The task is to determine the optimal clusters out of the data.

Instance

X

Y

1

7

8

2

2

4

3

6

4

4

3

2

5

6

5

6

5

7

7

3

3

8

1

4

9

5

4

10

7

7

11

7

6

12

2

1

After plotting the data points on a 2D chart, we can see that roughly two clusters are possible, where below-left is the first cluster and the top-right is another cluster, but in many practical cases, there would be so many variables (or dimensions) that, we cannot simply visualize them. Hence, we need a mathematical and algorithmic way to solve these types of problems.

Iteration 1: Let us assume two centers from two instances out of all the 12 instances. Here, we have chosen instance 1 (X = 7, Y = 8) and instance 8 (X = 1, Y = 4), as they seem to be at both extremes. For each instance, we will calculate its Euclidean distances with respect to both centroids and assign it to the nearest cluster center.

Instance

X

Y

Centroid 1 distance

Centroid 2 distance

Assigned cluster

1

7

8

7.21

0.00

C2

2

2

4

1.00

6.40

C1

3

6

4

5.00

4.12

C2

4

3

2

2.83

7.21

C1

5

6

5

5.10

3.16

C2

6

5

7

5.00

2.24

C2

7

3

3

2.24

6.40

C1

8

1

4

0.00

7.21

C1

9

5

4

4.00

4.47

C1

10

7

7

6.71

1.00

C2

11

7

6

6.32

2.00

C2

12

2

1

3.16

8.60

C1

Centroid 1

1

4

Centroid 2

7

8

 

The Euclidean distance between two points A (X1, Y1) and B (X2, Y2) is shown as follows:

Centroid distance calculations are performed by taking Euclidean distances. A sample calculation has been shown as follows. For instance, six with respect to both centroids (centroid 1 and centroid 2).

The following chart describes the assignment of instances to both centroids, which was shown in the preceding table format:

If we carefully observe the preceding chart, we realize that all the instances seem to be assigned appropriately apart from instance 9 (X =5, Y = 4). However, in later stages, it should be assigned appropriately. Let us see in the below steps how the assignments evolve.

Iteration 2: In this iteration, new centroids are calculated from the assigned instances for that cluster or centroid. New centroids are calculated based on the simple average of the assigned points.

Instance

X

Y

Assigned cluster

1

7

8

C2

2

2

4

C1

3

6

4

C2

4

3

2

C1

5

6

5

C2

6

5

7

C2

7

3

3

C1

8

1

4

C1

9

5

4

C1

10

7

7

C2

11

7

6

C2

12

2

1

C1

Centroid 1

2.67

3

Centroid 2

6.33

6.17

 

Sample calculations for centroids 1 and 2 are shown as follows. A similar methodology will be applied on all subsequent iterations as well:

After updating the centroids, we need to reassign the instances to the nearest centroids, which we will be performing in iteration 3.

Iteration 3: In this iteration, new assignments are calculated based on the Euclidean distance between instances and new centroids. In the event of any changes, new centroids will be calculated iteratively until no changes in assignments are possible or the number of iterations is reached. The following table describes the distance measures between new centroids and all the instances:

Instance

X

Y

Centroid 1 distance

Centroid 2 distance

Previously assigned cluster

Newly assigned cluster

Changed?

1

7

8

6.61

1.95

C2

C2

No

2

2

4

1.20

4.84

C1

C1

No

3

6

4

3.48

2.19

C2

C2

No

4

3

2

1.05

5.34

C1

C1

No

5

6

5

3.88

1.22

C2

C2

No

6

5

7

4.63

1.57

C2

C2

No

7

3

3

0.33

4.60

C1

C1

No

8

1

4

1.95

5.75

C1

C1

No

9

5

4

2.54

2.55

C1

C1

No

10

7

7

5.89

1.07

C2

C2

No

11

7

6

5.27

0.69

C2

C2

No

12

2

1

2.11

6.74

C1

C1

No

Centroid 1

2.67

3

Centroid 2

6.33

6.17

 

It seems that there are no changes registered. Hence, we can say that the solution is converged. One important thing to note here is that all the instances are very clearly classified well, apart from instance 9 (X = 5, Y = 4). Based on instinct, it seems like it should be assigned to centroid 2, but after careful calculation, that instance is more proximate to cluster 1 than cluster 2. However, the difference in distance is low (2.54 with centroid 1 and 2.55 with centroid 2).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset