Evaluating the performance of linear regression models

In the previous section, we learned how to fit a regression model on training data. However, you learned in previous chapters that it is crucial to test the model on data that it hasn't seen during training to obtain a more unbiased estimate of its performance.

As we remember from Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning, we want to split our dataset into separate training and test datasets where we use the former to fit the model and the latter to evaluate its performance to generalize to unseen data. Instead of proceeding with the simple regression model, we will now use all variables in the dataset and train a multiple regression model:

>>> from sklearn.model_selection import train_test_split
>>> X = df.iloc[:, :-1].values
>>> y = df['MEDV'].values
>>> X_train, X_test, y_train, y_test = train_test_split(
...       X, y, test_size=0.3, random_state=0)
>>> slr = LinearRegression()
>>> slr.fit(X_train, y_train)
>>> y_train_pred = slr.predict(X_train)
>>> y_test_pred = slr.predict(X_test)

Since our model uses multiple explanatory variables, we can't visualize the linear regression line (or hyperplane to be precise) in a two-dimensional plot, but we can plot the residuals (the differences or vertical distances between the actual and predicted values) versus the predicted values to diagnose our regression model. Residual plots are a commonly used graphical tool for diagnosing regression models. They can help detect nonlinearity and outliers, and check whether the errors are randomly distributed.

Using the following code, we will now plot a residual plot where we simply subtract the true target variables from our predicted responses:

>>> plt.scatter(y_train_pred,  y_train_pred - y_train,
...             c='steelblue', marker='o', edgecolor='white',
...             label='Training data')
>>> plt.scatter(y_test_pred,  y_test_pred - y_test,
...             c='limegreen', marker='s', edgecolor='white',
...             label='Test data')
>>> plt.xlabel('Predicted values')
>>> plt.ylabel('Residuals')
>>> plt.legend(loc='upper left')
>>> plt.hlines(y=0, xmin=-10, xmax=50, color='black', lw=2)
>>> plt.xlim([-10, 50])
>>> plt.show()

After executing the code, we should see a residual plot with a line passing through the x-axis origin as shown here:

Evaluating the performance of linear regression models

In case of a perfect prediction, the residuals would be exactly zero, which we will probably never encounter in realistic and practical applications. However, for a good regression model, we would expect that the errors are randomly distributed and the residuals should be randomly scattered around the centerline. If we see patterns in a residual plot, it means that our model is unable to capture some explanatory information, which has leaked into the residuals, as we can slightly see in our previous residual plot. Furthermore, we can also use residual plots to detect outliers, which are represented by the points with a large deviation from the centerline.

Another useful quantitative measure of a model's performance is the so-called Mean Squared Error (MSE), which is simply the averaged value of the SSE cost that we minimized to fit the linear regression model. The MSE is useful to compare different regression models or for tuning their parameters via grid search and cross-validation, as it normalizes the SSE by the sample size:

Evaluating the performance of linear regression models

Let's compute the MSE of our training and test predictions:

>>> from sklearn.metrics import mean_squared_error
>>> print('MSE train: %.3f, test: %.3f' % (
...     mean_squared_error(y_train, y_train_pred),
...     mean_squared_error(y_test, y_test_pred)))
MSE train: 19.958, test: 27.196

We see that the MSE on the training set is 19.96, and the MSE of the test set is much larger, with a value of 27.20, which is an indicator that our model is overfitting the training data.

Sometimes it may be more useful to report the coefficient of determination (Evaluating the performance of linear regression models), which can be understood as a standardized version of the MSE, for better interpretability of the model's performance. Or in other words, Evaluating the performance of linear regression models is the fraction of response variance that is captured by the model. The Evaluating the performance of linear regression models value is defined as:

Evaluating the performance of linear regression models

Here, SSE is the sum of squared errors and SST is the total sum of squares:

Evaluating the performance of linear regression models

In other words, SST is simply the variance of the response.

Let us quickly show that Evaluating the performance of linear regression models is indeed just a rescaled version of the MSE:

Evaluating the performance of linear regression models
Evaluating the performance of linear regression models
Evaluating the performance of linear regression models

For the training dataset, the Evaluating the performance of linear regression models is bounded between 0 and 1, but it can become negative for the test set. If Evaluating the performance of linear regression models, the model fits the data perfectly with a corresponding Evaluating the performance of linear regression models.

Evaluated on the training data, the Evaluating the performance of linear regression models of our model is 0.765, which doesn't sound too bad. However, the Evaluating the performance of linear regression models on the test dataset is only 0.673, which we can compute by executing the following code:

>>> from sklearn.metrics import r2_score
>>> print('R^2 train: %.3f, test: %.3f' % 
...       (r2_score(y_train, y_train_pred),
...        r2_score(y_test, y_test_pred)))
R^2 train: 0.765, test: 0.673
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset