Visualizing results

How would you visualize results coming from a logistic regression? We are dealing here with multiple explanatory variables against one response variable. Trying to visualize all of the involved dimensions, equal to the number of x's plus a dimension for y, wouldn't be feasible.

One thing we can definitely do is to choose one, and even two relevant variables, and visualize how the observed and predicted values vary as they seem varies.

Since our main aim is always to define which type of customers are more likely to default, I would select among the variables showing significant relationships in terms of p-values, the ones with the higher beta coefficients. Those variables will be the ones for which a one-unit variation will produce a higher increase in terms of probability of defaulting.

Let's have a look again at our variables and coefficients:

Looking at coefficient significance, we can select the following three variables:

  • previous_default
  • ROS
  • company_revenues

Among those three, the ones that are associated with the higher coefficient values (taken into their absolute term) are:

  • previous_default
  • company_revenues

Just to let you familiarize yourself with the object resulting from a glm() call, let's observe the value of these two coefficients, isolating them from the whole vector of the estimated coefficient.

We can do this directly from the logistic object. This object is actually really similar to the one resulting from the glm() function—it is a list of different elements resulting from the estimation activity. Among these elements, there is one named coefficients, which conveniently stores all of the estimated coefficients. It is a named vector, meaning that every number stored is paired with a name that is equal to the name of the corresponding variable.

To extrapolate the coefficient it is therefore sufficient to select the coefficients element within the logistic list and filter this element, employing the name of the variable you are interested in:

logistic$coefficients['previous_default']

previous_default
0.5680765

logistic$coefficients['company_revenues']

company_revenues
-1.976669e-06

Which is a number that considered in its absolute value, is way lower than the previous 0.56. We can therefore conclude that the most relevant element when evaluating a customer to predict how probable it is that they will repay their bill, is to look at their previous default history. After that, the second most relevant character will be its size: the smaller they will be, the more probable it is that it will be difficult for them to pay their bills.

Let's try to actually visualize the relationship between company_revenues and the predicted probability of default.

First of all, we have to obtain a dataset showing our two explanatory variables together with the estimated probability of default. We can do this directly from the results of the glm() call, selecting and slicing the model data frame stored in the resulting logistic object, and binding it with the fitted.value object stored within the same logistic object:

logistic$model %>% 
select(company_revenues) %>%
cbind(probability = logistic$fitted.values,.)-> dataviz_dataset

Now that we have the dataset read, visualization is just a matter of passing to the ggplot() function, specifying company_revenues as x and probability as y:

dataviz_dataset %>% 
ggplot(aes(x = company_revenues,y = probability)) +
geom_point()

What can you see from the plot? You surely notice that the increased level of company revenues is associated with an overall lower value of predicted probability. This is coherent with the negative sign of our coefficient. Why we do not have all probability on the same level for a given level of company_revenues?

Because there are also other explanatory variables that for the different predicted values show different observed values.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset