Modeling using logistic regression

Logistic regression is a type of regression model where the dependent or class variable is not continuous but categorical, just as in our case, credit rating is the dependent variable with two classes. In principle, logistic regression is usually perceived as a special case of the family of generalized linear models. This model functions by trying to find out the relationship between the class variable and the other independent feature variables by estimating probabilities. It uses the logistic or sigmoid function for estimating these probabilities. Logistic regression does not predict classes directly but the probability of the outcome. For our model, since we are dealing with a binary classification problem, we will be dealing with binomial logistic regression.

First we will load the library dependencies as follows and separate the testing feature and class variables:

library(caret) # model training and evaluation
library(ROCR) # model evaluation
source("performance_plot_utils.R") # plotting metric results
## separate feature and class variables
test.feature.vars <- test.data[,-1]
test.class.var <- test.data[,1]

Now we will train the initial model with all the independent variables as follows:

> formula.init <- "credit.rating ~ ."
> formula.init <- as.formula(formula.init)
> lr.model <- glm(formula=formula.init, data=train.data, family="binomial")

We can view the model details using the summary(lr.model) command, which shows you the various variables and their importance based on their significance values. We show a part of these details in the following snapshot:

Modeling using logistic regression

You can see that the model automatically performs one-hot encoding of categorical variables, which is basically having a variable for each category in that variable. The variables with stars beside them have p-values < 0.05 (which we discussed in the previous chapter) and are therefore significant.

Next, we perform predictions on the test data and evaluate the results as follows:

> lr.predictions <- predict(lr.model, test.data, type="response")
> lr.predictions <- round(lr.predictions)
> confusionMatrix(data=lr.predictions, reference=test.class.var, positive='1')

On running this, we get a confusion matrix with associated metrics, which we discussed earlier, which are shown in the following figure. It is quite interesting to see that we achieved an overall accuracy of 71.75%, which is quite decent, considering this dataset has a majority of good credit rating customers. It is predicting bad credit ratings quite well, which is evident from the specificity of 48%. Sensitivity is 83%, which is quite good, NPV is 58%, and PPV is 76%.

Modeling using logistic regression

We will now try to build another model with some selected features and see how it performs. If you remember, we had some generic features that are important for classification, which we obtained in the earlier section on feature selection. We will still run feature selection specifically for logistic regression to see feature importance using the following code snippet:

formula <- "credit.rating ~ ."
formula <- as.formula(formula)
control <- trainControl(method="repeatedcv", number=10, repeats=2)
model <- train(formula, data=train.data, method="glm", 
               trControl=control)
importance <- varImp(model, scale=FALSE)
plot(importance)

We get the following plot from which we select the top five variables to build the next model. As you can see, reading the plot is pretty simple. The greater the importance, the more important the variable is. Feel free to add more variables and build different models using them!

Modeling using logistic regression

Next, we build the model using a similar approach to before and test the model performance on the test data using the following code snippet:

> formula.new <- "credit.rating ~ account.balance + credit.purpose 
                      + previous.credit.payment.status + savings 
                      + credit.duration.months"
> formula.new <- as.formula(formula.new)
> lr.model.new <- glm(formula=formula.new, data=train.data, family="binomial")
> lr.predictions.new <- predict(lr.model.new, test.data, type="response") 
> lr.predictions.new <- round(lr.predictions.new)
> confusionMatrix(data=lr.predictions.new, reference=test.class.var, positive='1')

We get the following confusion matrix. However, if you look at the model evaluation results, as shown in following output, you will see that now accuracy has slightly increased and is 72.25%. Sensitivity has shot up to 94%, which is excellent, but sadly this has happened at the cost of specificity, which has gone down to 27%, and you can clearly see that more bad credit ratings are being predicted as good, which is 95 out of the total 130 bad credit rating customers in the test data! NPV has gone up to 69% because fewer positive credit ratings are being misclassified as false negatives because of higher sensitivity.

Modeling using logistic regression

Now comes the question of which model we want to select for predictions. This does not solely depend on the accuracy but on the domain and business requirements of the problem. If we predict a customer with a bad credit rating (0) as good (1), it means we are going to approve the credit loan for the customer who will end up not paying it, which will cause losses to the bank. However, if we predict a customer with good credit rating (1) as bad (0), it means we will deny him the loan in which case the bank will neither profit nor will incur any losses. This is much better than incurring huge losses by wrongly predicting bad credit ratings as good.

Therefore, we choose our first model as the best one and now we will view some metric evaluation plots using the following code snippet:

> lr.model.best <- lr.model
> lr.prediction.values <- predict(lr.model.best, test.feature.vars, type="response")
> predictions <- prediction(lr.prediction.values, test.class.var)
> par(mfrow=c(1,2))
> plot.roc.curve(predictions, title.text="LR ROC Curve")
> plot.pr.curve(predictions, title.text="LR Precision/Recall Curve")

We get the following plots from the preceding code:

Modeling using logistic regression

You can see from the preceding plot that the AUC is 0.74, which is pretty good for a start. We will now build the next predictive model using support vector machines using a similar process and see how it fares.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset