Problems

  1. Compute the centroid of the following clusters:

a) 2, 3, 4
b) 100$, 400$, 1000$
c) (10,20), (40, 60), (0, 40)
d) (200$, 40km), (300$, 60km), (500$, 100km), (250$, 200km)
e) (1,2,4), (0,0,3), (10,20,5), (4,8,2), (5,0,1)

  1. Cluster the following datasets into the 2, 3 and 4 clusters using k-means clustering algorithm:

a) 0, 2, 5, 4, 8, 10, 12, 11.
b) (2,2), (2,5), (10,4), (3,5), (7,3), (5,9), (2,8), (4,10), (7,4), (4,4), (5,8), (9,3).

 

  1. (Couples and the number of their children) We are given the following ages of the couples and their number of the children.
Couple number
Wife age
Husband age
Number of children
1
48
49
5
2
40
43
2
3
24
28
1
4
49
42
3
5
32
34
0
6
24
27
0
7
29
32
2
8
35
35
2
9
33
36
1
10
42
47
3
11
22
27
2
12
41
45
4
13
39
43
4
14
36
38
2
15
30
32
1
16
36
38
0
17
36
39
3
18
37
38
?

We would like to guess using clustering how many children a couple has where the age of the husband is 37 and the age of the wife is 38.

Analysis:

  1.     a) (1/3)*(2+3+4)=3

b) (1/3)*(100$+400$+1000$)=500$
c) ((10+40+0)/3,(20+60+40)/3)=(50/3, 120/3)=(50/3, 40)
d) ((200$+300$+500$+250$)/4,(40km+60km+100km+200km)/4)
e) =(1250$/4,400km/4)=(312.5$,100km)
f) ((1+0+10+4+5)/5,(2+0+20+8+0)/5,(4+3+5+2+1)/5)=(4,6,3)

  1. We add a second coordinate and set it to 0 for all the features. This way the distance between the features does not change and we can use the clustering algorithm we implemented earlier in this chapter.

Input:

# source_code/5/problem5_2.csv
0,0
2,0
5,0
4,0
8,0
10,0
12,0
11,0

For 2 clusters:

$ python k-means_clustering.py problem5_2.csv 2 last
The total number of steps: 2
The history of the algorithm:
Step number 0: point_groups = [((0.0, 0.0), 0), ((2.0, 0.0), 0), ((5.0, 0.0), 0), ((4.0, 0.0), 0), ((8.0, 0.0), 1), ((10.0, 0.0), 1), ((12.0, 0.0), 1), ((11.0, 0.0), 1)]
centroids = [(0.0, 0.0), (12.0, 0.0)]
Step number 1: point_groups = [((0.0, 0.0), 0), ((2.0, 0.0), 0), ((5.0, 0.0), 0), ((4.0, 0.0), 0), ((8.0, 0.0), 1), ((10.0, 0.0), 1), ((12.0, 0.0), 1), ((11.0, 0.0), 1)]
centroids = [(2.75, 0.0), (10.25, 0.0)]

For 3 clusters:

$ python k-means_clustering.py problem5_2.csv 3 last
The total number of steps: 2
The history of the algorithm:
Step number 0: point_groups = [((0.0, 0.0), 0), ((2.0, 0.0), 0), ((5.0, 0.0), 2), ((4.0, 0.0), 2), ((8.0, 0.0), 2), ((10.0, 0.0), 1), ((12.0, 0.0), 1), ((11.0, 0.0), 1)]
centroids = [(0.0, 0.0), (12.0, 0.0), (5.0, 0.0)]
Step number 1: point_groups = [((0.0, 0.0), 0), ((2.0, 0.0), 0), ((5.0, 0.0), 2), ((4.0, 0.0), 2), ((8.0, 0.0), 2), ((10.0, 0.0), 1), ((12.0, 0.0), 1), ((11.0, 0.0), 1)]
centroids = [(1.0, 0.0), (11.0, 0.0), (5.666666666666667, 0.0)]

For 4 clusters:

$ python k-means_clustering.py problem5_2.csv 4 last
The total number of steps: 2
The history of the algorithm:
Step number 0: point_groups = [((0.0, 0.0), 0), ((2.0, 0.0), 0), ((5.0, 0.0), 2), ((4.0, 0.0), 2), ((8.0, 0.0), 3), ((10.0, 0.0), 1), ((12.0, 0.0), 1), ((11.0, 0.0), 1)]
centroids = [(0.0, 0.0), (12.0, 0.0), (5.0, 0.0), (8.0, 0.0)]
Step number 1: point_groups = [((0.0, 0.0), 0), ((2.0, 0.0), 0), ((5.0, 0.0), 2), ((4.0, 0.0), 2), ((8.0, 0.0), 3), ((10.0, 0.0), 1), ((12.0, 0.0), 1), ((11.0, 0.0), 1)]
centroids = [(1.0, 0.0), (11.0, 0.0), (4.5, 0.0), (8.0, 0.0)]

b) We use the implemented algorithm again.

Input:

# source_code/5/problem5_2b.csv
2,2
2,5
10,4
3,5
7,3
5,9
2,8
4,10
7,4
4,4
5,8
9,3

Output for 2 clusters:

$ python k-means_clustering.py problem5_2b.csv 2 last
The total number of steps: 3
The history of the algorithm:
Step number 0: point_groups = [((2.0, 2.0), 0), ((2.0, 5.0), 0), ((10.0, 4.0), 1), ((3.0, 5.0), 0), ((7.0, 3.0), 1), ((5.0, 9.0), 1), ((2.0, 8.0), 0), ((4.0, 10.0), 0), ((7.0, 4.0), 1), ((4.0, 4.0), 0), ((5.0, 8.0), 1), ((9.0, 3.0), 1)]
centroids = [(2.0, 2.0), (10.0, 4.0)]
Step number 1: point_groups = [((2.0, 2.0), 0), ((2.0, 5.0), 0), ((10.0, 4.0), 1), ((3.0, 5.0), 0), ((7.0, 3.0), 1), ((5.0, 9.0), 0), ((2.0, 8.0), 0), ((4.0, 10.0), 0), ((7.0, 4.0), 1), ((4.0, 4.0), 0), ((5.0, 8.0), 0), ((9.0, 3.0), 1)]
centroids = [(2.8333333333333335, 5.666666666666667), (7.166666666666667, 5.166666666666667)]
Step number 2: point_groups = [((2.0, 2.0), 0), ((2.0, 5.0), 0), ((10.0, 4.0), 1), ((3.0, 5.0), 0), ((7.0, 3.0), 1), ((5.0, 9.0), 0), ((2.0, 8.0), 0), ((4.0, 10.0), 0), ((7.0, 4.0), 1), ((4.0, 4.0), 0), ((5.0, 8.0), 0), ((9.0, 3.0), 1)]
centroids = [(3.375, 6.375), (8.25, 3.5)]

Output for 3 clusters:

$ python k-means_clustering.py problem5_2b.csv 3 last
The total number of steps: 2
The history of the algorithm:
Step number 0: point_groups = [((2.0, 2.0), 0), ((2.0, 5.0), 0), ((10.0, 4.0), 1), ((3.0, 5.0), 0), ((7.0, 3.0), 1), ((5.0, 9.0), 2), ((2.0, 8.0), 2), ((4.0, 10.0), 2), ((7.0, 4.0), 1), ((4.0, 4.0), 0), ((5.0, 8.0), 2), ((9.0, 3.0), 1)]
centroids = [(2.0, 2.0), (10.0, 4.0), (4.0, 10.0)]
Step number 1: point_groups = [((2.0, 2.0), 0), ((2.0, 5.0), 0), ((10.0, 4.0), 1), ((3.0, 5.0), 0), ((7.0, 3.0), 1), ((5.0, 9.0), 2), ((2.0, 8.0), 2), ((4.0, 10.0), 2), ((7.0, 4.0), 1), ((4.0, 4.0), 0), ((5.0, 8.0), 2), ((9.0, 3.0), 1)]
centroids = [(2.75, 4.0), (8.25, 3.5), (4.0, 8.75)]

Output for 4 clusters:

$ python k-means_clustering.py problem5_2b.csv 4 last
The total number of steps: 2
The history of the algorithm:
Step number 0: point_groups = [((2.0, 2.0), 0), ((2.0, 5.0), 3), ((10.0, 4.0), 1), ((3.0, 5.0), 3), ((7.0, 3.0), 1), ((5.0, 9.0), 2), ((2.0, 8.0), 2), ((4.0, 10.0), 2), ((7.0, 4.0), 1), ((4.0, 4.0), 3), ((5.0, 8.0), 2), ((9.0, 3.0), 1)]
centroids = [(2.0, 2.0), (10.0, 4.0), (4.0, 10.0), (3.0, 5.0)]
Step number 1: point_groups = [((2.0, 2.0), 0), ((2.0, 5.0), 3), ((10.0, 4.0), 1), ((3.0, 5.0), 3), ((7.0, 3.0), 1), ((5.0, 9.0), 2), ((2.0, 8.0), 2), ((4.0, 10.0), 2), ((7.0, 4.0), 1), ((4.0, 4.0), 3), ((5.0, 8.0), 2), ((9.0, 3.0), 1)]
centroids = [(2.0, 2.0), (8.25, 3.5), (4.0, 8.75), (3.0, 4.666666666666667)]
  1. We are given 17 couples and their number of children and would like to find out how many children has the 18th couple. We will use the first 14 couples as data and then the next 3 couples for the cross-validation to determine the number of clusters k that we will use to find out how many children the 18th couple is expected to have.

After clustering we will say that a couple is likely to have about the number of the children that is the average of the children in that cluster. Using the cross-validation we will choose the number of the clusters that will minimize the difference between the actual number of the children and the predicted number of the children. We will capture this difference for all the items in the cluster cumulatively as the square root of the squares of the differences of children for each couple. This will minimize the variance of the random variable for the predicted number of the children for the 18th couple.

We will perform the clustering into 2,3,4 and 5 clusters.

Input:

# source_code/5/couples_children.csv
48,49
40,43
24,28
49,42
32,34
24,27
29,32
35,35
33,36
42,47
22,27
41,45
39,43
36,38
30,32
36,38
36,39
37,38

Output for 2 clusters:

A couple listed for a cluster is of the form (couple_number,(wife_age,husband_age)).

Cluster 0: [(1, (48.0, 49.0)), (2, (40.0, 43.0)), (4, (49.0, 42.0)), (10, (42.0, 47.0)), (12, (41.0, 45.0)), (13, (39.0, 43.0)), (14, (36.0, 38.0)), (16, (36.0, 38.0)), (17, (36.0, 39.0)), (18, (37.0, 38.0))]
Cluster 1: [(3, (24.0, 28.0)), (5, (32.0, 34.0)), (6, (24.0, 27.0)), (7, (29.0, 32.0)), (8, (35.0, 35.0)), (9, (33.0, 36.0)), (11, (22.0, 27.0)), (15, (30.0, 32.0))]

We would like to determine the expected number of the children for the 15th couple (30,32), i.e where a wife is 30 years old and the husband is 32 years old. (30,32) is in the cluster 1. The couples in the cluster 1 are: (24.0, 28.0), (32.0, 34.0), (24.0, 27.0), (29.0, 32.0), (35.0, 35.0), (33.0, 36.0), (22.0, 27.0), (30.0, 32.0). Out of these and the first 14 couples used for the data the remaining couples are: (24.0, 28.0), (32.0, 34.0), (24.0, 27.0), (29.0, 32.0), (35.0, 35.0), (33.0, 36.0), (22.0, 27.0). The average number of the children for these couples is est15=8/7~1.14. This is the estimated number of the children for the 15th couple based on the data from the first 14 couples.

The estimated number of the children for the 16th couple is est16=23/7~3.29. The estimated number of the children for the 17th couple is also est17=23/7~3.29 since both 16th and 17th couple belong to the same cluster.

Now we will calculate the error E2 (2 for 2 clusters) between the estimated number of the children (e.g. denoted est15 for the 15th couple) and the actual number of the children (example. denoted act15 for the 15th couple ) as follows:

E2=sqrt(sqr(est15-act15)+sqr(est16-act16)+sqr(est17-act17))

=sqrt(sqr(8/7-1)+sqr(23/7-0)+sqr(23/7-3))~3.3

Now that we have calculated the error E2, we will calculate the errors of the estimation with the other number of clusters. We will choose the number of the clusters with the least error to estimate the number of the children for the 18th couple.

Output for 3 clusters:

Cluster 0: [(1, (48.0, 49.0)), (2, (40.0, 43.0)), (4, (49.0, 42.0)), (10, (42.0, 47.0)), (12, (41.0, 45.0)), (13, (39.0, 43.0))]
Cluster 1: [(3, (24.0, 28.0)), (6, (24.0, 27.0)), (7, (29.0, 32.0)), (11, (22.0, 27.0)), (15, (30.0, 32.0))]
Cluster 2: [(5, (32.0, 34.0)), (8, (35.0, 35.0)), (9, (33.0, 36.0)), (14, (36.0, 38.0)), (16, (36.0, 38.0)), (17, (36.0, 39.0)), (18, (37.0, 38.0))]

Now the 15th couple is in the cluster 1, 16th couple in the cluster 2, 17th couple in the cluster 2. So the estimated number of the children for each couple is 5/4=1.25.

The error E3 of the estimation is:

E3=sqrt((1.25-1)2+(1.25-0)2+(1.25-3)2)~2.17

Output for 4 clusters:

Cluster 0: [(1, (48.0, 49.0)), (4, (49.0, 42.0)), (10, (42.0, 47.0)), (12, (41.0, 45.0))]
Cluster 1: [(3, (24.0, 28.0)), (6, (24.0, 27.0)), (11, (22.0, 27.0))]
Cluster 2: [(2, (40.0, 43.0)), (13, (39.0, 43.0)), (14, (36.0, 38.0)), (16, (36.0, 38.0)), (17, (36.0, 39.0)), (18, (37.0, 38.0))]
Cluster 3: [(5, (32.0, 34.0)), (7, (29.0, 32.0)), (8, (35.0, 35.0)), (9, (33.0, 36.0)), (15, (30.0, 32.0))]

The 15th couple is in the cluster 3, 16th in the cluster 2, 17th in the cluster 2. So the estimated number of the children for the 15th couple is 5/4=1.25. The estimated number of the children for the 16th and 17th couple is 8/3~2.67 children.

The error E4 of the estimation is:

E4=sqrt((1.25-1)2+(8/3-0)2+(8/3-3)2)~2.70

Output for 5 clusters:

Cluster 0: [(1, (48.0, 49.0)), (4, (49.0, 42.0))]
Cluster 1: [(3, (24.0, 28.0)), (6, (24.0, 27.0)), (11, (22.0, 27.0))]
Cluster 2: [(8, (35.0, 35.0)), (9, (33.0, 36.0)), (14, (36.0, 38.0)), (16, (36.0, 38.0)), (17, (36.0, 39.0)), (18, (37.0, 38.0))]
Cluster 3: [(5, (32.0, 34.0)), (7, (29.0, 32.0)), (15, (30.0, 32.0))]
Cluster 4: [(2, (40.0, 43.0)), (10, (42.0, 47.0)), (12, (41.0, 45.0)), (13, (39.0, 43.0))]

The 15th couple is in the cluster 3, 16th in the cluster 2, 17th in the cluster 2. So the estimated number of the children for the 15th couple is 1. The estimated number of the children for the 16th and 17th couple is 5/3~1.67.

The error E5 of the estimation is:

E5=sqrt((1-1)2+(5/3-0)2+(5/3-3)2)~2.13

Using cross-validation to determine the outcome:

We used 14 couples as data for the estimation and 3 other couples for cross-validation to find the best parameter of k clusters among the values 2,3,4,5. We may try to cluster into more clusters, but since we have so relatively very little data, it should be sufficient to cluster into the 5 clusters at most. Let us summarize the errors of the estimation.

Number of clusters
Error rate
2
3.3
3
2.17
4
2.7
5
2.13

The error rate is the least for 3 and 5 clusters. The fact that the error rate goes up for 4 clusters and then down again for 5 clusters may indicate that we may not have enough data to make a good estimate. A natural expectation would be that there are not local maxims of errors for the values of k greater than 2. Moreover the difference between the error for clustering with 3 and 5 clusters is very small and one cluster out of 5 is smaller than one cluster out of 3. For this reason we choose 3 clusters over 5 to estimate the number of the children for the 18th couple.

When clustering into the 3 clusters, 18th couple is in the cluster 2. Therefore the estimated number of the children for the 18th couple is 1.25.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset