House ownership – choosing the number of clusters

Let us take the example from the first chapter about the house ownership.

Age Annual income in USD House ownership status
23 50000 non-owner
37 34000 non-owner
48 40000 owner
52 30000 non-owner
28 95000 owner
25 78000 non-owner
35 130000 owner
32 105000 owner
20 100000 non-owner
40 60000 owner
50 80000 Peter

We would like to predict if Peter is a house owner using clustering.

Analysis:

Just as in the first chapter, we will have to scale the data since the income axis is by orders of magnitude greater and thus would diminish the impact of the age axis which actually has a good predictive power in this kind of problem. This is because it is expected that older people have had more time to settle down, save money and buy a house than the younger ones.

We apply the same rescaling from the Chapter 1 and get the following table:

Age Scaled age Annual income in USD Scaled annual income House ownership status
23 0.09375 50000 0.2 non-owner
37 0.53125 34000 0.04 non-owner
48 0.875 40000 0.1 owner
52 1 30000 0 non-owner
28 0.25 95000 0.65 owner
25 0.15625 78000 0.48 non-owner
35 0.46875 130000 1 owner
32 0.375 105000 0.75 owner
20 0 100000 0.7 non-owner
40 0.625 60000 0.3 owner
50 0.9375 80000 0.5 ?

Given the table, we produce the input file for the algorithm and execute it, clustering the features into the two clusters.
Input:

# source_code/5/house_ownership2.csv
0.09375,0.2
0.53125,0.04
0.875,0.1
1,0
0.25,0.65
0.15625,0.48
0.46875,1
0.375,0.75
0,0.7
0.625,0.3
0.9375,0.5

Output for two clusters:

$ python k-means_clustering.py house_ownership2.csv 2 last
The total number of steps: 3
The history of the algorithm:
Step number 0: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 0), ((0.15625, 0.48), 0), ((0.46875, 1.0), 0), ((0.375, 0.75), 0), ((0.0, 0.7), 0), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)]
centroids = [(0.09375, 0.2), (1.0, 0.0)]
Step number 1: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 1), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 0), ((0.15625, 0.48), 0), ((0.46875, 1.0), 0), ((0.375, 0.75), 0), ((0.0, 0.7), 0), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)]
centroids = [(0.26785714285714285, 0.5457142857142857), (0.859375, 0.225)]
Step number 2: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 1), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 0), ((0.15625, 0.48), 0), ((0.46875, 1.0), 0), ((0.375, 0.75), 0), ((0.0, 0.7), 0), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)]
centroids = [(0.22395833333333334, 0.63), (0.79375, 0.188)]

The blue cluster contains scaled features (0.09375,0.2), (0.25,0.65), (0.15625,0.48), (0.46875,1), (0.375,0.75), (0,0.7) or unscaled ones (23,50000), (28,95000), (25,78000), (35,130000), (32,105000), (20,100000). The red cluster contains scaled features (0.53125,0.04), (0.875,0.1), (1,0), (0.625,0.3), (0.9375,0.5) or unscaled ones (37,34000), (48,40000), (52,30000), (40,60000), (50,80000).

So Peter belongs to the red cluster. What is the proportion of house owners in a red cluster not counting Peter? 2/4 or 1/2 of the people in the red cluster are house owners. Thus the red cluster to which Peter belongs does not seem to have a high predictive power in determining whether Peter would be a house owner or not. We may try to cluster the data into more clusters in the hope that we would gain a purer cluster that could be more reliable for a prediction of the house-ownership for Peter. Let us therefore try to cluster the data into the three clusters.

Output for three clusters:

$ python k-means_clustering.py house_ownership2.csv 3 last
The total number of steps: 3
The history of the algorithm:
Step number 0: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 2), ((0.15625, 0.48), 0), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 0), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)]
centroids = [(0.09375, 0.2), (1.0, 0.0), (0.46875, 1.0)]
Step number 1: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 1), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 2), ((0.15625, 0.48), 0), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 2), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)]
centroids = [(0.1953125, 0.355), (0.859375, 0.225), (0.3645833333333333, 0.7999999999999999)]
Step number 2: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 1), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 2), ((0.15625, 0.48), 0), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 2), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)]
centroids = [(0.125, 0.33999999999999997), (0.79375, 0.188), (0.2734375, 0.7749999999999999)]

The red cluster has stayed the same. Let us therefore cluster the data into the 4 clusters.

Output for four clusters:

$ python k-means_clustering.py house_ownership2.csv 4 last
The total number of steps: 2
The history of the algorithm:
Step number 0: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 3), ((0.15625, 0.48), 3), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 3), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)]
centroids = [(0.09375, 0.2), (1.0, 0.0), (0.46875, 1.0), (0.0, 0.7)]
Step number 1: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 3), ((0.15625, 0.48), 3), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 3), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)]
centroids = [(0.3125, 0.12000000000000001), (0.859375, 0.225), (0.421875, 0.875), (0.13541666666666666, 0.61)]

Now the red cluster where Peter belongs has changed. What is the proportion of the house owners in the red cluster now? If we do not count Peter, 2/3 of people in the red cluster own a house. When we clustered into the 2 or 3 clusters, the proportion was only ½ which did not tell us about the prediction of whether Peter is a house-owner or not. Now there is a majority of house owners in the red cluster not counting Peter, so we have a higher belief that Peter should also be a house owner. However, 2/3 is still a relatively low confidence for classifying Peter as a house owner. Let us partition the data into the 5 partitions to see what would happen.

Output for five clusters:

$ python k-means_clustering.py house_ownership2.csv 5 last
The total number of steps: 2
The history of the algorithm:
Step number 0: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 3), ((0.15625, 0.48), 3), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 3), ((0.625, 0.3), 4), ((0.9375, 0.5), 4)]
centroids = [(0.09375, 0.2), (1.0, 0.0), (0.46875, 1.0), (0.0, 0.7), (0.9375, 0.5)]
Step number 1: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 3), ((0.15625, 0.48), 3), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 3), ((0.625, 0.3), 4), ((0.9375, 0.5), 4)]
centroids = [(0.3125, 0.12000000000000001), (0.9375, 0.05), (0.421875, 0.875), (0.13541666666666666, 0.61), (0.78125, 0.4)]

Now the red cluster contains only Peter and a non-owner. This clustering suggests that Peter is more likely a non-owner as well. However, according to the previous cluster Peter would be more likely an owner of a house. Therefore it may not be so clear whether Peter owns a house or not. Collecting more data would improve our analysis and should be carried out before making a definite classification in this problem.

From our analysis we noticed that a different number of clusters can result in a different result for a classification as the nature of members in an individual cluster can change. After collecting more data we should perform a cross-validation to determine the number of the clusters that classifies the data with the highest accuracy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset