For each person, we are given their age, yearly income, and whether their is a house or not:
Age |
Annual income in USD |
House ownership status |
23 |
50,000 |
Non-owner |
37 |
34,000 |
Non-owner |
48 |
40,000 |
Owner |
52 |
30,000 |
Non-owner |
28 |
95,000 |
Owner |
25 |
78,000 |
Non-owner |
35 |
130,000 |
Owner |
32 |
105,000 |
Owner |
20 |
100,000 |
Non-owner |
40 |
60,000 |
Owner |
50 |
80,000 |
Peter |
The aim is to predict whether Peter, aged 50, with an income of $80k/year, owns a house and could be a potential customer for our insurance company.
Analysis:
In this case, we could try to apply the 1-NN algorithm. However, we should be careful about how we are going to measure the distances between the data points, since the income range is much wider than the age range. Income levels of $115k and $116k are $1,000 apart. These two data points with these incomes would result in a very long distance. However, relative to each other, the difference is not too large. Because we consider both measures (age and yearly income) to be about as important, we would scale both from 0 to 1 according to the formula:
ScaledQuantity = (ActualQuantity-MinQuantity)/(MaxQuantity-MinQuantity)
In our particular case, this reduces to:
ScaledAge = (ActualAge-MinAge)/(MaxAge-MinAge)
ScaledIncome = (ActualIncome- inIncome)/(MaxIncome-inIncome)
After scaling, we get the following data:
Age |
Scaled age |
Annual income in USD |
Scaled annual income |
House ownership status |
23 |
0.09375 |
50,000 |
0.2 |
Non-owner |
37 |
0.53125 |
34,000 |
0.04 |
Non-owner |
48 |
0.875 |
40,000 |
0.1 |
Owner |
52 |
1 |
30,000 |
0 |
Non-owner |
28 |
0.25 |
95,000 |
0.65 |
Owner |
25 |
0.15625 |
78,000 |
0.48 |
Non-owner |
35 |
0.46875 |
130,000 |
1 |
Owner |
32 |
0.375 |
105,000 |
0.75 |
Owner |
20 |
0 |
100,000 |
0.7 |
Non-owner |
40 |
0.625 |
60,000 |
0.3 |
Owner |
50 |
0.9375 |
80,000 |
0.5 |
? |
Now, if we apply the 1-NN algorithm with the Euclidean metric, we will find out that Peter more than likely owns a house. Note that, without rescaling, the algorithm would yield a different result. Refer to exercise 1.5.