Gender classification - clustering to classify

We take the data from the gender classification in the problem Chapter 2, Naive Bayes, Analysis point 6:

Height in cm

Weight in kg

Hair length

Gender

180

75

Short

Male

174

71

Short

Male

184

83

Short

Male

168

63

Short

Male

178

70

Long

Male

170

59

Long

Female

164

53

Short

Female

155

46

Long

Female

162

52

Long

Female

166

55

Long

Female

172

60

Long

?

To simplify the matters we will remove the column Hair length. We also remove the column Gender since we would like to cluster the people in the table based on their height and weight. We would like to find out whether the 11th person in the table is more likely to be a man or a woman using clustering:

Height in cm

Weight in kg

180

75

174

71

184

83

168

63

178

70

170

59

164

53

155

46

162

52

166

55

172

60

Analysis:

We may apply scaling to the initial data, but to simplify the matters, we will use the unscaled data in the algorithm. We will cluster the data we have into the two clusters since there are two possibilities for genders – a male or a female. Then we will aim to classify a person with the height 172cm and weight 60kg to be more likely a man if and only if there are more men in that cluster. The clustering algorithm is a very efficient technique. Thus classifying this way is very fast, especially if there is a large number of the features to classify.

So let us apply k-means clustering algorithm to the data we have. First we pick up the initial centroids. Let the first centroid be for example a person with the height 180cm and the weight 75kg denoted in a vector as (180,75). Then the point that is furthest away from (180,75) is (155,46). So that will be the second centroid.

The points that are closer to the first centroid (180,75) by taking Euclidean distance are (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60). So these points will be in the first cluster. The points that are closer to the second centroid (155,46) are (155,46), (164,53), (162,52), (166,55). So these points will be in the second cluster. We display the current situation of these two clusters in Image 5.1. below.

Image 5.1: Clustering of people by their height and weight

Let us recompute the centroids of the clusters. The blue cluster with the features (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60) will have the centroid ((180+174+184+168+178+170+172)/7,(75+71+83+63+70+59+60)/7)~(175.14,68.71).

The red cluster with the features (155,46), (164,53), (162,52), (166,55) will have the centroid ((155+164+162+166)/4,(46+53+52+55)/4)=(161.75, 51.5).

Reclassifying the points using the new centroid, the classes of the points do not change. The blue cluster will have the points (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60). The red cluster will have the points (155,46), (164,53), (162,52), (166,55). Therefore the clustering algorithm terminates with clusters as displayed in the following image 5.2:

Image 5.2: Clustering of people by their height and weight

Now we would like to classify the instance (172,60) as to whether it is a male or a female. The instance (172,60) is in the blue cluster. So it is similar to the features in the blue cluster. Are the remaining features in the blue cluster more likely males or females? 5 out of 6 features are males, only 1 is a female. Since the majority of the features are males in the blue cluster and the person (172,60) is in the blue cluster as well, we classify the person with the height 172cm and the weight 60kg as a male.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset