Model fitting

Let us define the dataset we are going to employ for our modeling activity. We will employ clean_casted_stored_data_validated_complete, removing the default_flag and the customer_code first because it is actually meaningless as an explanatory variable:

clean_casted_stored_data_validated_complete %>% 
(-default_flag) %>% 
(-customer_code) -> training_data

And we are ready now to fit our model:

multiple_regression <- lm(as.numeric(default_numeric)~., data= training_data)

You should have already noticed the small point after the ~ token. It actually means that all the available explanatory variables will be fitted against the as.numeric(default_numeric) response variable. Before looking at model assumptions validation, we could go for a walk through the summary output:

summary(multiple_regression)

This is actually a rich output that tells us a lot about the model. At first you find the formula you have passed to the lm() function. Then you see the actual values of your model coefficients. Starting from the top, you find the intercept, the β₀ term we were talking about some minutes ago, for each of the terms of your model, the value of the related β, the standard error of the parameter estimate related to the variable, a t value, and a probability of finding that t value in the hypothesis of no relationship existing between the explanatory variable and the response variable (the so called p-value). I am starting to think that we should take some time to talk about p-values and hypothesis testing (ehm ehm ... the author speaking here, you are going to talk about model hypothesis testing in Chapter 10, A Different Outlook to Problems with Classification Models), but for now we can just consider those little stars on the far right of the parameters as an extreme way of evaluating the level of significance of the relationship between a given x and the y.

You find a proper legend for those symbols at the bottom of the output:

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The maximum level of significance is represented by those three stars with a corresponding probability of causally encountering the observed behavior of the variable y given the level of x equal to, or really close to, zero.

What can you see there? We have some variables such as the corporation or the previous default being extremely significant, while the other has an extremely low level of significance. We are going to deal with this when performing stepwise regression, which will ensure that only the most significant variables will be kept in our model.

Before going into model assumption validation, I think we should take a closer look at the warning message at the beginning of our output:

Coefficients: (2 not defined because of singularities)

What do you think this singularities term means? Let's try to guess which variables the warning message is referring to. You can easily get it by looking at the NAs you find in the list of coefficients:

subsidiary                     NA         NA      NA       NA
customer_agreement             NA         NA      NA       NA

Shouldn't we take a closer look at those variables?

Let's inspect them by passing them to unique() to get an idea of which values they are composed of:

clean_casted_stored_data_validated_complete$customer_agreement %>% unique()
[1] 1

clean_casted_stored_data_validated_complete$subsidiary %>% unique()
[1] 1

Oh, it seems that we couldn't have obtained that much from those three attributes, since they are both composed of just one value. We'd better remove them from our dataset before moving on:

clean_casted_stored_data_validated_complete %>% 
(-default_flag) %>% 
(-customer_code) %>% 
(-c(customer_agreement, subsidiary))-> training_data

Let us estimate our model again, employing the new dataset deputed of the two useless attributes:

multiple_regression_new <- lm(as.numeric(default_numeric)~., data= training_data)

We are now ready to skip to assumptions validation. Oh yes, you have also got those last three lines at the end of the output:

Residual standard error: 0.4206 on 11510 degrees of freedom
 Multiple R-squared: 0.08393, Adjusted R-squared: 0.08297
 F-statistic: 87.87 on 12 and 11510 DF, p-value: < 2.2e-16

I told you we were going to talk about model performance shortly, didn't I? We will go through all those R-squared and standard errors in a moment, just after verifying if the assumptions are met. Why?

Because we do not need to worry about our model's performance if our model cannot be held as statistically valid.

Table of Contents for Model fitting

Create new playlist

Sign In

Sign Up

Table of Contents for
Model fitting