Best subsets

The following code is, for the most part, a rehash of what we developed in Chapter 2, Linear Regression - The Blocking and Tackling of Machine Learning. We will create the best subset object using the regsubsets() command and specify the train portion of data. The variables that are selected will then be used in a model on the test set, which we will evaluate with a mean squared error calculation.

The model that we are building is written out as lpsa ~ . with the tilde and period stating that we want to use all the remaining variables in our data frame, with the exception of the response:

    > subfit <- regsubsets(lpsa ~ ., data = train)

With the model built, you can produce the best subset with two lines of code. The first one turns the summary model into an object where we can extract the various subsets and determine the best one with the which.min() command. In this instance, I will use BIC, which was discussed in Chapter 2, Linear Regression - The Blocking and Tackling of Machine Learning, which is as follows:

    > b.sum <- summary(subfit)
> which.min(b.sum$bic)
[1] 3

The output is telling us that the model with the 3 features has the lowest bic value. A plot can be produced to examine the performance across the subset combinations, as follows:

    > plot(b.sum$bic, type = "l", xlab = "# of Features", ylab = "BIC", 
main = "BIC score by Feature Inclusion")

The following is the output of the preceding command:

A more detailed examination is possible by plotting the actual model object, as follows:

    > plot(subfit, scale = "bic", main = "Best Subset Features")

The output of the preceding command is:

So, the previous plot shows us that the three features included in the lowest BIC are lcavol, lweight, and gleason. It is noteworthy that lcavol is included in every combination of the models. This is consistent with our earlier exploration of the data. We are now ready to try this model on the test portion of the data, but first, we will produce a plot of the fitted values versus the actual values looking for linearity in the solution, and as a check on the constancy of the variance. A linear model will need to be created with just the three features of interest. Let's put this in an object called ols for the OLS. Then the fits from ols will be compared to the actual in the training set, as follows:

    > ols <- lm(lpsa ~ lcavol + lweight + gleason, data = train)
> plot(ols$fitted.values, train$lpsa, xlab = "Predicted", ylab =
"Actual", main = "Predicted vs Actual")

The following is the output of the preceding command:

An inspection of the plot shows that a linear fit should perform well on this data and that the non-constant variance is not a problem. With that, we can see how this performs on the test set data by utilizing the predict() function and specifying newdata=test, as follows:

    > pred.subfit <- predict(ols, newdata = test)
> plot(pred.subfit, test$lpsa , xlab = "Predicted", ylab =
"Actual", main = "Predicted vs Actual")

The values in the object can then be used to create a plot of the Predicted vs Actual values, as shown in the following image:

The plot doesn't seem to be too terrible. For the most part, it is a linear fit with the exception of what looks to be two outliers on the high end of the PSA score. Before concluding this section, we will need to calculate Mean Squared Error (MSE) to facilitate comparison across the various modeling techniques. This is easy enough where we will just create the residuals and then take the mean of their squared values, as follows:

    > resid.subfit <- test$lpsa - pred.subfit
> mean(resid.subfit^2)
[1] 0.5084126

So, MSE of 0.508 is our benchmark for going forward.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset