We take the data from the gender classification in the problem Chapter 2, Naive Bayes, Analysis point 6:
Height in cm |
Weight in kg |
Hair length |
Gender |
180 |
75 |
Short |
Male |
174 |
71 |
Short |
Male |
184 |
83 |
Short |
Male |
168 |
63 |
Short |
Male |
178 |
70 |
Long |
Male |
170 |
59 |
Long |
Female |
164 |
53 |
Short |
Female |
155 |
46 |
Long |
Female |
162 |
52 |
Long |
Female |
166 |
55 |
Long |
Female |
172 |
60 |
Long |
? |
To simplify the matters we will remove the column Hair length. We also remove the column Gender since we would like to cluster the people in the table based on their height and weight. We would like to find out whether the 11th person in the table is more likely to be a man or a woman using clustering:
Height in cm |
Weight in kg |
180 |
75 |
174 |
71 |
184 |
83 |
168 |
63 |
178 |
70 |
170 |
59 |
164 |
53 |
155 |
46 |
162 |
52 |
166 |
55 |
172 |
60 |
Analysis:
We may apply scaling to the initial data, but to simplify the matters, we will use the unscaled data in the algorithm. We will cluster the data we have into the two clusters since there are two possibilities for genders – a male or a female. Then we will aim to classify a person with the height 172cm and weight 60kg to be more likely a man if and only if there are more men in that cluster. The clustering algorithm is a very efficient technique. Thus classifying this way is very fast, especially if there is a large number of the features to classify.
So let us apply k-means clustering algorithm to the data we have. First we pick up the initial centroids. Let the first centroid be for example a person with the height 180cm and the weight 75kg denoted in a vector as (180,75). Then the point that is furthest away from (180,75) is (155,46). So that will be the second centroid.
The points that are closer to the first centroid (180,75) by taking Euclidean distance are (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60). So these points will be in the first cluster. The points that are closer to the second centroid (155,46) are (155,46), (164,53), (162,52), (166,55). So these points will be in the second cluster. We display the current situation of these two clusters in Image 5.1. below.
Let us recompute the centroids of the clusters. The blue cluster with the features (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60) will have the centroid ((180+174+184+168+178+170+172)/7,(75+71+83+63+70+59+60)/7)~(175.14,68.71).
The red cluster with the features (155,46), (164,53), (162,52), (166,55) will have the centroid ((155+164+162+166)/4,(46+53+52+55)/4)=(161.75, 51.5).
Reclassifying the points using the new centroid, the classes of the points do not change. The blue cluster will have the points (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60). The red cluster will have the points (155,46), (164,53), (162,52), (166,55). Therefore the clustering algorithm terminates with clusters as displayed in the following image 5.2:
Now we would like to classify the instance (172,60) as to whether it is a male or a female. The instance (172,60) is in the blue cluster. So it is similar to the features in the blue cluster. Are the remaining features in the blue cluster more likely males or females? 5 out of 6 features are males, only 1 is a female. Since the majority of the features are males in the blue cluster and the person (172,60) is in the blue cluster as well, we classify the person with the height 172cm and the weight 60kg as a male.