Let us take the example from the first chapter about the house ownership.
Age | Annual income in USD | House ownership status |
23 | 50000 | non-owner |
37 | 34000 | non-owner |
48 | 40000 | owner |
52 | 30000 | non-owner |
28 | 95000 | owner |
25 | 78000 | non-owner |
35 | 130000 | owner |
32 | 105000 | owner |
20 | 100000 | non-owner |
40 | 60000 | owner |
50 | 80000 | Peter |
We would like to predict if Peter is a house owner using clustering.
Analysis:
Just as in the first chapter, we will have to scale the data since the income axis is by orders of magnitude greater and thus would diminish the impact of the age axis which actually has a good predictive power in this kind of problem. This is because it is expected that older people have had more time to settle down, save money and buy a house than the younger ones.
We apply the same rescaling from the Chapter 1 and get the following table:
Age | Scaled age | Annual income in USD | Scaled annual income | House ownership status |
23 | 0.09375 | 50000 | 0.2 | non-owner |
37 | 0.53125 | 34000 | 0.04 | non-owner |
48 | 0.875 | 40000 | 0.1 | owner |
52 | 1 | 30000 | 0 | non-owner |
28 | 0.25 | 95000 | 0.65 | owner |
25 | 0.15625 | 78000 | 0.48 | non-owner |
35 | 0.46875 | 130000 | 1 | owner |
32 | 0.375 | 105000 | 0.75 | owner |
20 | 0 | 100000 | 0.7 | non-owner |
40 | 0.625 | 60000 | 0.3 | owner |
50 | 0.9375 | 80000 | 0.5 | ? |
Given the table, we produce the input file for the algorithm and execute it, clustering the features into the two clusters.
Input:
# source_code/5/house_ownership2.csv 0.09375,0.2 0.53125,0.04 0.875,0.1 1,0 0.25,0.65 0.15625,0.48 0.46875,1 0.375,0.75 0,0.7 0.625,0.3 0.9375,0.5
Output for two clusters:
$ python k-means_clustering.py house_ownership2.csv 2 last The total number of steps: 3 The history of the algorithm: Step number 0: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 0), ((0.15625, 0.48), 0), ((0.46875, 1.0), 0), ((0.375, 0.75), 0), ((0.0, 0.7), 0), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)] centroids = [(0.09375, 0.2), (1.0, 0.0)] Step number 1: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 1), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 0), ((0.15625, 0.48), 0), ((0.46875, 1.0), 0), ((0.375, 0.75), 0), ((0.0, 0.7), 0), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)] centroids = [(0.26785714285714285, 0.5457142857142857), (0.859375, 0.225)] Step number 2: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 1), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 0), ((0.15625, 0.48), 0), ((0.46875, 1.0), 0), ((0.375, 0.75), 0), ((0.0, 0.7), 0), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)] centroids = [(0.22395833333333334, 0.63), (0.79375, 0.188)]
The blue cluster contains scaled features (0.09375,0.2), (0.25,0.65), (0.15625,0.48), (0.46875,1), (0.375,0.75), (0,0.7) or unscaled ones (23,50000), (28,95000), (25,78000), (35,130000), (32,105000), (20,100000). The red cluster contains scaled features (0.53125,0.04), (0.875,0.1), (1,0), (0.625,0.3), (0.9375,0.5) or unscaled ones (37,34000), (48,40000), (52,30000), (40,60000), (50,80000).
So Peter belongs to the red cluster. What is the proportion of house owners in a red cluster not counting Peter? 2/4 or 1/2 of the people in the red cluster are house owners. Thus the red cluster to which Peter belongs does not seem to have a high predictive power in determining whether Peter would be a house owner or not. We may try to cluster the data into more clusters in the hope that we would gain a purer cluster that could be more reliable for a prediction of the house-ownership for Peter. Let us therefore try to cluster the data into the three clusters.
Output for three clusters:
$ python k-means_clustering.py house_ownership2.csv 3 last The total number of steps: 3 The history of the algorithm: Step number 0: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 2), ((0.15625, 0.48), 0), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 0), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)] centroids = [(0.09375, 0.2), (1.0, 0.0), (0.46875, 1.0)] Step number 1: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 1), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 2), ((0.15625, 0.48), 0), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 2), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)] centroids = [(0.1953125, 0.355), (0.859375, 0.225), (0.3645833333333333, 0.7999999999999999)] Step number 2: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 1), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 2), ((0.15625, 0.48), 0), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 2), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)] centroids = [(0.125, 0.33999999999999997), (0.79375, 0.188), (0.2734375, 0.7749999999999999)]
The red cluster has stayed the same. Let us therefore cluster the data into the 4 clusters.
Output for four clusters:
$ python k-means_clustering.py house_ownership2.csv 4 last The total number of steps: 2 The history of the algorithm: Step number 0: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 3), ((0.15625, 0.48), 3), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 3), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)] centroids = [(0.09375, 0.2), (1.0, 0.0), (0.46875, 1.0), (0.0, 0.7)] Step number 1: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 3), ((0.15625, 0.48), 3), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 3), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)] centroids = [(0.3125, 0.12000000000000001), (0.859375, 0.225), (0.421875, 0.875), (0.13541666666666666, 0.61)]
Now the red cluster where Peter belongs has changed. What is the proportion of the house owners in the red cluster now? If we do not count Peter, 2/3 of people in the red cluster own a house. When we clustered into the 2 or 3 clusters, the proportion was only ½ which did not tell us about the prediction of whether Peter is a house-owner or not. Now there is a majority of house owners in the red cluster not counting Peter, so we have a higher belief that Peter should also be a house owner. However, 2/3 is still a relatively low confidence for classifying Peter as a house owner. Let us partition the data into the 5 partitions to see what would happen.
Output for five clusters:
$ python k-means_clustering.py house_ownership2.csv 5 last The total number of steps: 2 The history of the algorithm: Step number 0: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 3), ((0.15625, 0.48), 3), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 3), ((0.625, 0.3), 4), ((0.9375, 0.5), 4)] centroids = [(0.09375, 0.2), (1.0, 0.0), (0.46875, 1.0), (0.0, 0.7), (0.9375, 0.5)] Step number 1: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 3), ((0.15625, 0.48), 3), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 3), ((0.625, 0.3), 4), ((0.9375, 0.5), 4)] centroids = [(0.3125, 0.12000000000000001), (0.9375, 0.05), (0.421875, 0.875), (0.13541666666666666, 0.61), (0.78125, 0.4)]
Now the red cluster contains only Peter and a non-owner. This clustering suggests that Peter is more likely a non-owner as well. However, according to the previous cluster Peter would be more likely an owner of a house. Therefore it may not be so clear whether Peter owns a house or not. Collecting more data would improve our analysis and should be carried out before making a definite classification in this problem.
From our analysis we noticed that a different number of clusters can result in a different result for a classification as the nature of members in an individual cluster can change. After collecting more data we should perform a cross-validation to determine the number of the clusters that classifies the data with the highest accuracy.