Let's use some different housing data from the City of Saratoga, CA. This time, we will look at the lot size and house price:
Lot size | House price (in $1,000) |
12,839 | 2,405 |
10,000 | 2,200 |
8,040 | 1,400 |
13,104 | 1,800 |
10,000 | 2,351 |
3,049 | 795 |
38,768 | 2,725 |
16,250 | 2,150 |
43,026 | 2,724 |
44,431 | 2,675 |
40,000 | 2,930 |
1,260 | 870 |
15,000 | 2,210 |
10,032 | 1,145 |
12,420 | 2,419 |
69,696 | 2,750 |
12,600 | 2,035 |
10,240 | 1,150 |
876 | 665 |
8,125 | 1,430 |
11,792 | 1,920 |
1,512 | 1,230 |
1,276 | 975 |
67,518 | 2,400 |
9,810 | 1,725 |
6,324 | 2,300 |
12,510 | 1,700 |
15,616 | 1,915 |
15476 | 2,278 |
13,390 | 2,497.5 |
1,158 | 725 |
2,000 | 870 |
2,614 | 730 |
13,433 | 2,050 |
12,500 | 3,330 |
15,750 | 1,120 |
13,996 | 4,100 |
10,450 | 1,655 |
7,500 | 1,550 |
12,125 | 2,100 |
14,500 | 2,100 |
10,000 | 1,175 |
10,019 | 2,047.5 |
48,787 | 3,998 |
53,579 | 2,688 |
10,788 | 2,251 |
11,865 | 1,906 |
Let's convert this data into a comma-separated value (CSV) file called saratoga.csv and draw it as a scatter plot:
Finding the number of clusters is a tricky task. Here, we have the advantage of visual inspection, which is not available for data on hyperplanes (more than three dimensions). Let's roughly divide the data into four clusters as follows:
Run the k-means algorithm to do the same and see how close our results come.