House ownership – choosing the number of clusters

Let us take the example from the first chapter about the house ownership.

Age	Annual income in USD	House ownership status
23	50000	non-owner
37	34000	non-owner
48	40000	owner
52	30000	non-owner
28	95000	owner
25	78000	non-owner
35	130000	owner
32	105000	owner
20	100000	non-owner
40	60000	owner
50	80000	Peter

We would like to predict if Peter is a house owner using clustering.

Analysis:

Just as in the first chapter, we will have to scale the data since the income axis is by orders of magnitude greater and thus would diminish the impact of the age axis which actually has a good predictive power in this kind of problem. This is because it is expected that older people have had more time to settle down, save money and buy a house than the younger ones.

We apply the same rescaling from the Chapter 1 and get the following table:

Age	Scaled age	Annual income in USD	Scaled annual income	House ownership status
23	0.09375	50000	0.2	non-owner
37	0.53125	34000	0.04	non-owner
48	0.875	40000	0.1	owner
52	1	30000	0	non-owner
28	0.25	95000	0.65	owner
25	0.15625	78000	0.48	non-owner
35	0.46875	130000	1	owner
32	0.375	105000	0.75	owner
20	0	100000	0.7	non-owner
40	0.625	60000	0.3	owner
50	0.9375	80000	0.5	?

Given the table, we produce the input file for the algorithm and execute it, clustering the features into the two clusters.
Input:

# source_code/5/house_ownership2.csv
0.09375,0.2
0.53125,0.04
0.875,0.1
1,0
0.25,0.65
0.15625,0.48
0.46875,1
0.375,0.75
0,0.7
0.625,0.3
0.9375,0.5

Output for two clusters:

$ python k-means_clustering.py house_ownership2.csv 2 last
The total number of steps: 3
The history of the algorithm:
Step number 0: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 0), ((0.15625, 0.48), 0), ((0.46875, 1.0), 0), ((0.375, 0.75), 0), ((0.0, 0.7), 0), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)]
centroids = [(0.09375, 0.2), (1.0, 0.0)]
Step number 1: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 1), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 0), ((0.15625, 0.48), 0), ((0.46875, 1.0), 0), ((0.375, 0.75), 0), ((0.0, 0.7), 0), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)]
centroids = [(0.26785714285714285, 0.5457142857142857), (0.859375, 0.225)]
Step number 2: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 1), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 0), ((0.15625, 0.48), 0), ((0.46875, 1.0), 0), ((0.375, 0.75), 0), ((0.0, 0.7), 0), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)]
centroids = [(0.22395833333333334, 0.63), (0.79375, 0.188)]

The blue cluster contains scaled features (0.09375,0.2), (0.25,0.65), (0.15625,0.48), (0.46875,1), (0.375,0.75), (0,0.7) or unscaled ones (23,50000), (28,95000), (25,78000), (35,130000), (32,105000), (20,100000). The red cluster contains scaled features (0.53125,0.04), (0.875,0.1), (1,0), (0.625,0.3), (0.9375,0.5) or unscaled ones (37,34000), (48,40000), (52,30000), (40,60000), (50,80000).

So Peter belongs to the red cluster. What is the proportion of house owners in a red cluster not counting Peter? 2/4 or 1/2 of the people in the red cluster are house owners. Thus the red cluster to which Peter belongs does not seem to have a high predictive power in determining whether Peter would be a house owner or not. We may try to cluster the data into more clusters in the hope that we would gain a purer cluster that could be more reliable for a prediction of the house-ownership for Peter. Let us therefore try to cluster the data into the three clusters.

Output for three clusters:

$ python k-means_clustering.py house_ownership2.csv 3 last
The total number of steps: 3
The history of the algorithm:
Step number 0: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 2), ((0.15625, 0.48), 0), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 0), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)]
centroids = [(0.09375, 0.2), (1.0, 0.0), (0.46875, 1.0)]
Step number 1: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 1), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 2), ((0.15625, 0.48), 0), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 2), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)]
centroids = [(0.1953125, 0.355), (0.859375, 0.225), (0.3645833333333333, 0.7999999999999999)]
Step number 2: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 1), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 2), ((0.15625, 0.48), 0), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 2), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)]
centroids = [(0.125, 0.33999999999999997), (0.79375, 0.188), (0.2734375, 0.7749999999999999)]

The red cluster has stayed the same. Let us therefore cluster the data into the 4 clusters.

Output for four clusters:

$ python k-means_clustering.py house_ownership2.csv 4 last
The total number of steps: 2
The history of the algorithm:
Step number 0: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 3), ((0.15625, 0.48), 3), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 3), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)]
centroids = [(0.09375, 0.2), (1.0, 0.0), (0.46875, 1.0), (0.0, 0.7)]
Step number 1: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 3), ((0.15625, 0.48), 3), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 3), ((0.625, 0.3), 1), ((0.9375, 0.5), 1)]
centroids = [(0.3125, 0.12000000000000001), (0.859375, 0.225), (0.421875, 0.875), (0.13541666666666666, 0.61)]

Now the red cluster where Peter belongs has changed. What is the proportion of the house owners in the red cluster now? If we do not count Peter, 2/3 of people in the red cluster own a house. When we clustered into the 2 or 3 clusters, the proportion was only ½ which did not tell us about the prediction of whether Peter is a house-owner or not. Now there is a majority of house owners in the red cluster not counting Peter, so we have a higher belief that Peter should also be a house owner. However, 2/3 is still a relatively low confidence for classifying Peter as a house owner. Let us partition the data into the 5 partitions to see what would happen.

Output for five clusters:

$ python k-means_clustering.py house_ownership2.csv 5 last
The total number of steps: 2
The history of the algorithm:
Step number 0: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 3), ((0.15625, 0.48), 3), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 3), ((0.625, 0.3), 4), ((0.9375, 0.5), 4)]
centroids = [(0.09375, 0.2), (1.0, 0.0), (0.46875, 1.0), (0.0, 0.7), (0.9375, 0.5)]
Step number 1: point_groups = [((0.09375, 0.2), 0), ((0.53125, 0.04), 0), ((0.875, 0.1), 1), ((1.0, 0.0), 1), ((0.25, 0.65), 3), ((0.15625, 0.48), 3), ((0.46875, 1.0), 2), ((0.375, 0.75), 2), ((0.0, 0.7), 3), ((0.625, 0.3), 4), ((0.9375, 0.5), 4)]
centroids = [(0.3125, 0.12000000000000001), (0.9375, 0.05), (0.421875, 0.875), (0.13541666666666666, 0.61), (0.78125, 0.4)]

Now the red cluster contains only Peter and a non-owner. This clustering suggests that Peter is more likely a non-owner as well. However, according to the previous cluster Peter would be more likely an owner of a house. Therefore it may not be so clear whether Peter owns a house or not. Collecting more data would improve our analysis and should be carried out before making a definite classification in this problem.

From our analysis we noticed that a different number of clusters can result in a different result for a classification as the nature of members in an individual cluster can change. After collecting more data we should perform a cross-validation to determine the number of the clusters that classifies the data with the highest accuracy.

Table of Contents for House ownership – choosing the number of clusters

Create new playlist

Sign In

Sign Up

Table of Contents for
House ownership – choosing the number of clusters