Predicting house prices with regression

In every example we have seen so far, we have faced what in Chapter 1, Machine Learning – A Gentle Introduction, we called classification problems: the output we aimed at predicting belonged to a discrete set. But often, we would want to predict a value extracted from the real line. The learning schema is still the same: fit a model to the training data, and evaluate on new data to get the target class whose value is a real number. Our classifier, instead of selecting a class from a list, should act as a real-valued function, which for each of the (possibly infinite) combination of learning features returns a real number. We could consider regression as classification with an infinite number of target classes.

Many problems can be modeled both as classification and regression tasks, depending on the class we selected as the target. For example, predicting blood sugar level is a regression task, while predicting if somebody has diabetes or not is a classification task.

In the example of the first figure, we have used a line to fit the learning data (composed of a sole attribute and a target value), that is, we have performed linear regression. If we want to predict the value of a new instance, we get their real-valued attribute and obtain the predicted value by projecting the inferred line into the second axis.

Predicting house prices with regression

In this section, we will compare several regression methods by using the same dataset. We will try to predict the price of a house as a function of its attributes. As the dataset, we will use the Boston house-prices dataset, which includes 506 instances, representing houses in the suburbs of Boston by 14 features, one of them (the median value of owner-occupied homes) being the target class (for a detailed reference, see http://archive.ics.uci.edu/ml/datasets/Housing). Each attribute in this dataset is real-valued.

The dataset is included in the standard scikit-learn distribution, so let's start by loading it:

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from sklearn.datasets import load_boston
>>> boston = load_boston()
>>> print boston.data.shape
(506, 13)
>>> print boston.feature_names
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT' 'MEDV']
>>> print np.max(boston.target), np.min(boston.target),  
    np.mean(boston.target)
50.0 5.0 22.5328063241

You should try printing boston.DESCR to get a feel of what each feature means. This is a very healthy habit: machine learning is not just number crunching, understanding the problem we are facing is crucial, especially to select the best learning model to use.

As usual, we start slicing our learning set into training and testing datasets, and normalizing the data:

>>> from sklearn.cross_validation import train_test_split
>>> X_train, X_test, y_train, y_test = 
    train_test_split(boston.data, boston.target, test_size=0.25, 
    random_state=33)
>>> from sklearn.preprocessing import StandardScaler
>>> scalerX = StandardScaler().fit(X_train)
>>> scalery = StandardScaler().fit(y_train)
>>> X_train = scalerX.transform(X_train)
>>> y_train = scalery.transform(y_train)
>>> X_test = scalerX.transform(X_test)
>>> y_test = scalery.transform(y_test)

Before looking at our best classifier, let's define how we will compare our results. Since we want to preserve our testing set for evaluating the performance of the final classifier, we should find a way to select the best model while avoiding overfitting. We already know the answer: cross-validation. Regression poses an additional problem: how should we evaluate our results? Accuracy is not a good idea, since we are predicting real values, it is almost impossible for us to predict exactly the final value. There are several measures that can be used (you can look at the list of functions under sklearn.metrics module). The most common is the R2 score, or coefficient of determination that measures the proportion of the outcomes variation explained by the model, and is the default score function for regression methods in scikit-learn. This score reaches its maximum value of 1 when the model perfectly predicts all the test target values. Using this measure, we will build a function that trains a model and evaluates its performance using five-fold cross-validation and the coefficient of determination.

>>> from sklearn.cross_validation import *
>>> def train_and_evaluate(clf, X_train, y_train):
>>>     clf.fit(X_train, y_train)
>>>     print "Coefficient of determination on training 
        set:",clf.score(X_train, y_train)
>>>     # create a k-fold cross validation iterator of k=5 folds
>>>     cv = KFold(X_train.shape[0], 5, shuffle=True, 
        random_state=33)
>>>     scores = cross_val_score(clf, X_train, y_train, cv=cv)
>>>     print "Average coefficient of determination using 5-fold 
        crossvalidation:",np.mean(scores)

First try – a linear model

The question that linear models try to answer is which hyperplane in the 14-dimensional space created by our learning features (including the target value) is located closer to them. After this hyperplane is found, prediction reduces to calculate the projection on the hyperplane of the new point, and returning the target value coordinate. Think of our first example in Chapter 1, Machine Learning – A Gentle Introduction, where we wanted to find a line separating our training instances. We could have used that line to predict the second learning attribute as a function of the first one, that is, linear regression.

But, what do we mean by closer? The usual measure is least squares: calculate the distance of each instance to the hyperplane, square it (to avoid sign problems), and sum them. The hyperplane whose sum is smaller is the least squares estimator (the hyperplane in the case if two dimensions are just a line).

Since we don't know how our data fits (it is difficult to print a 14-dimension scatter plot!), we will start with a linear model called SGDRegressor, which tries to minimize squared loss.

>>> from sklearn import linear_model
>>> clf_sgd = linear_model.SGDRegressor(loss='squared_loss', 
    penalty=None,  random_state=42)
>>> train_and_evaluate(clf_sgd,X_train,y_train)
Coefficient of determination on training set: 0.743303511411
Average coefficient of determination using 5-fold crossvalidation: 0.715166411086

We can print the hyperplane coefficients our method has calculated, which is as follows:

>>> print clf_sgd.coef_
[-0.07641527  0.06963738 -0.05935062  0.10878438 -0.06356188  0.37260998 -0.02912886 -0.20180631  0.08463607 -0.05534634 
-0.19521922 0.0653966 -0.36990842]

You probably noted the penalty=None parameter when we called the method. The penalization parameter for linear regression methods is introduced to avoid overfitting. It does this by penalizing those hyperplanes having some of their coefficients too large, seeking hyperplanes where each feature contributes more or less the same to the predicted value. This parameter is generally the L2 norm (the squared sums of the coefficients) or the L1 norm (that is the sum of the absolute value of the coefficients). Let's see how our model works if we introduce an L2 penalty.

>>> clf_sgd1 = linear_model.SGDRegressor(loss='squared_loss', 
    penalty='l2',  random_state=42)
>>> train_and_evaluate(clf_sgd1, X_train, y_train) 
Coefficient of determination on training set: 0.743300616394
Average coefficient of determination using 5-fold crossvalidation: 0.715166962417

In this case, we did not obtain an improvement.

Second try – Support Vector Machines for regression

The regression version of SVM can be used instead to find the hyperplane.

>>> from sklearn import svm
>>> clf_svr = svm.SVR(kernel='linear')
>>> train_and_evaluate(clf_svr, X_train, y_train)
Coefficient of determination on training set: 0.71886923342
Average coefficient of determination using 5-fold crossvalidation: 0.694983285734

Here, we had no improvement. However, one of the main advantages of SVM is that (using what we called the kernel trick) we can use a nonlinear function, for example, a polynomial function to approximate our data.

>>> clf_svr_poly = svm.SVR(kernel='poly')
>>> train_and_evaluate(clf_svr_poly, X_train, y_train)
Coefficient of determination on training set: 0.904109273301
Average coefficient of determination using 5-fold cross validation: 0.754993478137

Now, our results are six points better in terms of coefficient of determination. We can actually improve this by using a Radial Basis Function (RBF) kernel.

>>> clf_svr_rbf = svm.SVR(kernel='rbf')
>>> train_and_evaluate(clf_svr_rbf, X_train, y_train)
Coefficient of determination on training set: 0.900132065979
Average coefficient of determination using 5-fold cross validation: 0.821626135903

RBF kernels have been used in several problems and have shown to be very effective. Actually, RBF is the default kernel used by SVM methods in scikit-learn.

Third try – Random Forests revisited

We can try a very different approach to regression using Random Forests. We have previously used Random Forests for classification. When used for regression, the tree growing procedure is exactly the same, but at prediction time, when we arrive at a leaf, instead of reporting the majority class, we return a representative real value, for example, the average of the target values.

Actually, we will use Extra Trees, implemented in the ExtraTreesRegressor class within the sklearn.ensemble module. This method adds an extra level of randomization. It not only selects for each tree a different, random subset of features, but also randomly selects the threshold for each decision.

>>> from sklearn import ensemble
>>> clf_et=ensemble.ExtraTreesRegressor(n_estimators=10, 
    compute_importances=True, random_state=42)
>>> train_and_evaluate(clf_et, X_train, y_train)
Coefficient of determination on training set: 1.0
Average coefficient of determination using 5-fold cross validation: 0.852511952001

The first thing to note is that we have not only completely eliminated underfitting (achieving perfect prediction on training values), but also improved the performance by three points while using cross-validation. An interesting feature of Extra Trees is that they allow computing the importance of each feature for the regression task. Let's compute this importance as follows:

>>> print sort(zip(clf_et.feature_importances_, 
    boston.feature_names), axis=0)

[['0.000231085384564' 'AGE']
 ['0.000909210196652' 'B']
 ['0.00162702734638' 'CHAS']
 ['0.00292361527201' 'CRIM']
 ['0.00472492264278' 'DIS']
 ['0.00489022243822' 'INDUS']
 ['0.0067481487587' 'LSTAT']
 ['0.00852353178943' 'NOX']
 ['0.00873406149286' 'PTRATIO']
 ['0.0366902590312' 'RAD']
 ['0.0982265323415' 'RM']
 ['0.385904111089' 'TAX']
 ['0.439867272217' 'ZN']]

We can see that ZN (proportion of residential land zoned for lots over 25,000 sq. ft.) and TAX (full-value property tax rate) are by far the most influential features on our final decision.

Evaluation

As usual, let's evaluate the performance of our best method on the testing set (previously, we slightly modified our measure_performance function to show the coefficient of determination):

>>> from sklearn import metrics
>>> def measure_performance(X, y, clf, show_accuracy=True, 
    show_classification_report=True, show_confusion_matrix=True, 
    show_r2_score=False):
>>>     y_pred = clf.predict(X)   
>>>     if show_accuracy:
>>>         print "Accuracy:{0:.3f}".format(
>>>            metrics.accuracy_score(y, y_pred)
>>>         ),"
"
>>>
>>>     if show_classification_report:
>>>         print "Classification report"
>>>         print metrics.classification_report(y, y_pred),"
"
>>>         
>>>     if show_confusion_matrix:
>>>         print "Confusion matrix"
>>>         print metrics.confusion_matrix(y, y_pred),"
"
>>>         
>>>     if show_r2_score:
>>>         print "Coefficient of determination:{0:.3f}".format(
>>>            metrics.r2_score(y, y_pred)
>>>         ),"
"
        
>>> measure_performance(X_test, y_test, clf_et, 
    show_accuracy=False, show_classification_report=False,
    show_confusion_matrix=False, show_r2_score=True)
Coefficient of determination:0.793

Once we have selected our best method and used all the available data, we could train our best method on the whole training set, but we will have no way to measure its performance on future data, simply because we do not have any more data available.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset