Test data evaluation

One of the things you need to do on out-of-sample data is scale it according to the original (training) data. The predict function that comes with the psych package allows you to do this effortlessly. We put those scaled and scored values into a dataframe we can then use to make the out-of-sample predictions:

> test_reduced <- as.matrix(test[, c(-1, -95)])

> test_scores <- data.frame(predict(pca_5, test_reduced, old.data = train[, c(-1, -95]))

Here, we just add the predicted and actual values:

> test_scores$testpred <- predict(earth_fit, test_scores)

> test_scores$weight <- test$Weightlbs

The results look good:

> caret::postResample(pred = test_scores$testpred, 
obs = test_scores$weight)
RMSE Rsquared MAE
7.8735 0.9468 5.1937

The performance declined just a little bit. I think we can move forward with this model. Further exploration of the outliers is in order to see whether there is measurement error, drop them from the analysis, or truncate them. In closing, let's see the plot of actual versus predicted:

> ggplot2::ggplot(test_scores, ggplot2::aes(x = testpred, y = weight)) +
ggplot2::geom_point() +
ggplot2::stat_smooth(method = "lm", se = FALSE) +
ggthemes::theme_excel_new()

The output of the preceding code is as follows:

It looks similar to the training data plot. Once again, there is at least one anomaly. How can our model predict a soldier to be about 140 pounds but they are actually almost 300? We could amuse ourselves pursuing this further, but let's move on.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset