Validating model assumption

Now that we have our estimates, it is time to validate the assumptions of absence of correlation and homoscedasticity. As usual, we have a package for that, and it is car by Professor John Fox. This package provides a full spectrum of validation tests for linear models. We are going to employ the DurbinWatsonTest() and the ncvTest() functions respectively to validate the auto-correlation and homoscedasticity assumption.

For both of them, all you need to do is call the function passing as an argument the regression model object. Let us start with the Breusch-Pagan test, to test if the variance of our residuals is substantially constant:

ncvTest(linear_regression_economic_sector)

Non-constant Variance Score Test
 Variance formula: ~ fitted.values
 Chisquare = 9.532657 Df = 1 p = 0.002018477

What do you say? Is it a good output? It is indeed, the p-value, the one you find after p on the second line, is definitely lower than the 0.05 threshold. We can, therefore, refuse the null hypothesis of our residuals not having a constant variance, and conclude that the NCV test has passed.

Now, it's time to move on to the Durbin-Watson test:

durbinWatsonTest(linear_regression_economic_sector)

lag Autocorrelation D-W Statistic p-value
 1 0.9958624 0.007780056 0
 Alternative hypothesis: rho != 0

The most relevant number here is the D-W one, which is a very low 0.007780056. If you remember, small numbers in the Durbin-Watson test mean the residuals are being positively auto-correlated. This is a bad result. We should, therefore, evaluate alternative formulations for our models. We could, for instance, look at the company_revenues attribute.

As we have just done for the economic sector attribute, just fit the model by employing the lm() function:

linear_regression_revenues <- lm(as.numeric(default_flag) ~ company_revenues, clean_casted_stored_data_validated_complete)

And then perform our diagnostic tests:

ncvTest(linear_regression_revenues)
durbinWatsonTest(linear_regression_revenues)

This results in:

Non-constant Variance Score Test
 Variance formula: ~ fitted.values
 Chisquare = 19.83106 Df = 1 p = 8.459667e-06

And:

lag Autocorrelation D-W Statistic p-value
 1 0.995317 0.008891022 0
 Alternative hypothesis: rho != 0

It seems we are having a bad output, aren't we?

To be honest, this was just to show you how those models work and how to estimate them in R. We are now going to make it more serious, considering all the variables we have at our disposal and how they interact with each other. That is to say, we are going to perform multiple linear regression.

Before getting into this, let us invest some time in learning how to visualize results from our estimated model with ggplot2. We are going to employ this for other models as well, and this will therefore not be wasted time.

Table of Contents for Validating model assumption

Create new playlist

Sign In

Sign Up

Table of Contents for
Validating model assumption