Gender classification - clustering to classify

We take the data from the gender classification in the problem Chapter 2, Naive Bayes, Analysis point 6:

Height in cm	Weight in kg	Hair length	Gender
180	75	Short	Male
174	71	Short	Male
184	83	Short	Male
168	63	Short	Male
178	70	Long	Male
170	59	Long	Female
164	53	Short	Female
155	46	Long	Female
162	52	Long	Female
166	55	Long	Female
172	60	Long	?

To simplify the matters we will remove the column Hair length. We also remove the column Gender since we would like to cluster the people in the table based on their height and weight. We would like to find out whether the 11th person in the table is more likely to be a man or a woman using clustering:

Height in cm	Weight in kg
180	75
174	71
184	83
168	63
178	70
170	59
164	53
155	46
162	52
166	55
172	60

Analysis:

We may apply scaling to the initial data, but to simplify the matters, we will use the unscaled data in the algorithm. We will cluster the data we have into the two clusters since there are two possibilities for genders – a male or a female. Then we will aim to classify a person with the height 172cm and weight 60kg to be more likely a man if and only if there are more men in that cluster. The clustering algorithm is a very efficient technique. Thus classifying this way is very fast, especially if there is a large number of the features to classify.

So let us apply k-means clustering algorithm to the data we have. First we pick up the initial centroids. Let the first centroid be for example a person with the height 180cm and the weight 75kg denoted in a vector as (180,75). Then the point that is furthest away from (180,75) is (155,46). So that will be the second centroid.

The points that are closer to the first centroid (180,75) by taking Euclidean distance are (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60). So these points will be in the first cluster. The points that are closer to the second centroid (155,46) are (155,46), (164,53), (162,52), (166,55). So these points will be in the second cluster. We display the current situation of these two clusters in Image 5.1. below.

Image 5.1: Clustering of people by their height and weight

Let us recompute the centroids of the clusters. The blue cluster with the features (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60) will have the centroid ((180+174+184+168+178+170+172)/7,(75+71+83+63+70+59+60)/7)~(175.14,68.71).

The red cluster with the features (155,46), (164,53), (162,52), (166,55) will have the centroid ((155+164+162+166)/4,(46+53+52+55)/4)=(161.75, 51.5).

Reclassifying the points using the new centroid, the classes of the points do not change. The blue cluster will have the points (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60). The red cluster will have the points (155,46), (164,53), (162,52), (166,55). Therefore the clustering algorithm terminates with clusters as displayed in the following image 5.2:

Image 5.2: Clustering of people by their height and weight

Now we would like to classify the instance (172,60) as to whether it is a male or a female. The instance (172,60) is in the blue cluster. So it is similar to the features in the blue cluster. Are the remaining features in the blue cluster more likely males or females? 5 out of 6 features are males, only 1 is a female. Since the majority of the features are males in the blue cluster and the person (172,60) is in the blue cluster as well, we classify the person with the height 172cm and the weight 60kg as a male.

Table of Contents for Gender classification - clustering to classify

Create new playlist

Sign In

Sign Up

Table of Contents for
Gender classification - clustering to classify