Let's look at the three features of the housing data of the City of Saratoga, CA, that is, house size, lot size, and price. Using PCA, we will merge the house size and lot size features into one feature, namely z. Let's call this feature z density of a house.
It is worth noting that it is not always possible to give meaning to the new feature created. In this case, it is easy as we have only two features to combine and we can use common sense to combine the effect of the two. In a more practical case, you may have 1,000 features that you are trying to project to 100 features. It may not be possible to give real-life meaning to each one of those 100 features.
In this exercise, we will derive the housing density using PCA and then we will do linear regression to see how this density affects the house price.
There is a preprocessing stage before we delve into PCA: feature scaling. Feature scaling comes into the picture when two features have ranges that are at different scales. Here, house size varies in the range of 800 sq. ft. to 7,000 sq. ft., while the lot size varies between 800 sq. ft. to a few acres.
Why did we not have to do feature scaling before? The answer is that we really did not have to put features on a level-playing field. Gradient descent is another area where feature scaling is very useful.
There are different ways of doing feature scaling:
- Dividing a feature value with a maximum value that will put every feature in the -1 ≤x ≤1 range
- Dividing a feature value with the range, that is, maximum value-minimum value
- Subtracting a feature value by its mean and then dividing it by the range
- Subtracting a feature value by its mean and then dividing it by the standard deviation
We are going to use the fourth choice to scale in the best way possible. The following is the data we are going to use for this recipe:
House size | Lot size | Scaled house size | Scaled lot size | House price (in $1,000) |
2,524 | 12,839 | -0.025 | -0.231 | 2,405 |
2,937 | 10,000 | 0.323 | -0.4 | 2,200 |
1,778 | 8,040 | -0.654 | -0.517 | 1,400 |
1,242 | 13,104 | -1.105 | -0.215 | 1,800 |
2,900 | 10,000 | 0.291 | -0.4 | 2,351 |
1,218 | 3,049 | -1.126 | -0.814 | 795 |
2,722 | 38,768 | 0.142 | 1.312 | 2,725 |
2,553 | 16,250 | -0.001 | -0.028 | 2,150 |
3,681 | 43,026 | 0.949 | 1.566 | 2,724 |
3,032 | 44,431 | 0.403 | 1.649 | 2,675 |
3,437 | 40,000 | 0.744 | 1.385 | 2,930 |
1,680 | 1,260 | -0.736 | -0.92 | 870 |
2,260 | 15,000 | -0.248 | -0.103 | 2,210 |
1,660 | 10,032 | -0.753 | -0.398 | 1,145 |
3,251 | 12,420 | 0.587 | -0.256 | 2,419 |
3,039 | 69,696 | 0.409 | 3.153 | 2,750 |
3,401 | 12,600 | 0.714 | -0.245 | 2,035 |
1,620 | 10,240 | -0.787 | -0.386 | 1,150 |
876 | 876 | -1.414 | -0.943 | 665 |
1,889 | 8,125 | -0.56 | -0.512 | 1,430 |
4,406 | 11,792 | 1.56 | -0.294 | 1,920 |
1,885 | 1,512 | -0.564 | -0.905 | 1,230 |
1,276 | 1,276 | -1.077 | -0.92 | 975 |
3,053 | 67,518 | 0.42 | 3.023 | 2,400 |
2,323 | 9,810 | -0.195 | -0.412 | 1,725 |
3,139 | 6,324 | 0.493 | -0.619 | 2,300 |
2,293 | 12,510 | -0.22 | -0.251 | 1,700 |
2,635 | 15,616 | 0.068 | -0.066 | 1,915 |
2,298 | 15,476 | -0.216 | -0.074 | 2,278 |
2,656 | 13,390 | 0.086 | -0.198 | 2,497.5 |
1,158 | 1,158 | -1.176 | -0.927 | 725 |
1,511 | 2,000 | -0.879 | -0.876 | 870 |
1,252 | 2,614 | -1.097 | -0.84 | 730 |
2,141 | 13,433 | -0.348 | -0.196 | 2,050 |
3,565 | 12,500 | 0.852 | -0.251 | 3,330 |
1,368 | 15,750 | -0.999 | -0.058 | 1,120 |
5,726 | 13,996 | 2.672 | -0.162 | 4,100 |
2,563 | 10,450 | 0.008 | -0.373 | 1,655 |
1,551 | 7,500 | -0.845 | -0.549 | 1,550 |
1,993 | 12,125 | -0.473 | -0.274 | 2,100 |
2,555 | 14,500 | 0.001 | -0.132 | 2,100 |
1,572 | 10,000 | -0.827 | -0.4 | 1,175 |
2,764 | 10,019 | 0.177 | -0.399 | 2,047.5 |
7,168 | 48,787 | 3.887 | 1.909 | 3,998 |
4,392 | 53,579 | 1.548 | 2.194 | 2,688 |
3,096 | 10,788 | 0.457 | -0.353 | 2,251 |
2,003 | 11,865 | -0.464 | -0.289 | 1,906 |
Let's take the scaled house size and scaled house price data and save it as scaledhousedata.csv.