The logistic regression model

We've already discussed the theory behind logistic regression, so we can begin fitting our models. An R installation comes with the glm() function fitting the generalized linear models, which are a class of models that includes logistic regression. The code syntax is similar to the lm() function that we used in the previous chapter. One big difference is that we must use the family = binomial argument in the function, which tells R to run a logistic regression method instead of the other versions of the generalized linear models. We will start by creating a model that includes all of the features on the train set and see how it performs on the test set, as follows:

    > full.fit <- glm(class ~ ., family = binomial, 
      data = train)
    > summary(full.fit)
    Call:
    glm(formula = class ~ ., family = binomial, data = 
      train) 
    Deviance Residuals:
        Min       1Q   Median       3Q      Max  
    -3.3397  -0.1387  -0.0716   0.0321   2.3559  
    Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  -9.4293     1.2273  -7.683 1.55e-14 
      ***
    thick         0.5252     0.1601   3.280 0.001039 **
    u.size       -0.1045     0.2446  -0.427 0.669165    
    u.shape       0.2798     0.2526   1.108 0.268044    
    adhsn         0.3086     0.1738   1.776 0.075722 .  
    s.size        0.2866     0.2074   1.382 0.167021    
    nucl          0.4057     0.1213   3.344 0.000826 
      ***
    chrom         0.2737     0.2174   1.259 0.208006    
    n.nuc         0.2244     0.1373   1.635 0.102126    
    mit           0.4296     0.3393   1.266 0.205402    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 
      '.' 0.1 ' ' 1
    (Dispersion parameter for binomial family taken to 
       be 1)
        Null deviance: 620.989  on 473  degrees of 
       freedom
    Residual deviance:  78.373  on 464  degrees of 
       freedom
    AIC: 98.373
    Number of Fisher Scoring iterations: 8

The summary() function allows us to inspect the coefficients and their p-values. We can see that only two features have p-values less than 0.05 (thickness and nuclei). An examination of the 95 percent confidence intervals can be called on with the confint() function, as follows:

    > confint(full.fit)
                       2.5 %     97.5 %
    (Intercept) -12.23786660 -7.3421509
    thick         0.23250518  0.8712407
    u.size       -0.56108960  0.4212527
    u.shape      -0.24551513  0.7725505
    adhsn        -0.02257952  0.6760586
    s.size       -0.11769714  0.7024139
    nucl          0.17687420  0.6582354
    chrom        -0.13992177  0.7232904
    n.nuc        -0.03813490  0.5110293
    mit          -0.14099177  1.0142786

Note that the two significant features have confidence intervals that do not cross zero. You cannot translate the coefficients in logistic regression as the change in Y is based on a one-unit change in X. This is where the odds ratio can be quite helpful. The beta coefficients from the log function can be converted to odds ratios with an exponent (beta).

In order to produce the odds ratios in R, we will use the following exp(coef()) syntax:

    > exp(coef(full.fit))
     (Intercept)        thick       u.size      u.shape        
     adhsn
    8.033466e-05 1.690879e+00 9.007478e-01 1.322844e+00 
     1.361533e+00
   s.size         nucl        chrom        n.nuc  mit
   1.331940e+00 1.500309e+00 1.314783e+00 1.251551e+00 
    1.536709e+00

The interpretation of an odds ratio is the change in the outcome odds resulting from a unit change in the feature. If the value is greater than 1, it indicates that, as the feature increases, the odds of the outcome increase. Conversely, a value less than 1 would mean that, as the feature increases, the odds of the outcome decrease. In this example, all the features except u.size will increase the log odds.

One of the issues pointed out during data exploration was the potential issue of multi-collinearity. It is possible to produce the VIF statistics that we did in linear regression with a logistic model in the following way:

    > library(car)
    > vif(full.fit)
       thick  u.size  u.shape  adhsn   s.size   nucl   
      chrom   n.nuc
     1.2352 3.2488  2.8303   1.3021  1.6356   1.3729 
       1.5234  1.3431
       mit
    1.059707

None of the values are greater than the VIF rule of thumb statistic of five, so collinearity does not seem to be a problem. Feature selection will be the next task; but, for now, let's produce some code to look at how well this model does on both the train and test sets.

You will first have to create a vector of the predicted probabilities, as follows:

    > train.probs <- predict(full.fit, type = 
       "response")
    > train.probs[1:5] #inspect the first 5 predicted 
        probabilities
    [1] 0.02052820 0.01087838 0.99992668 0.08987453 
        0.01379266

Next, we need to evaluate how well the model performed in training and then evaluate how it fits on the test set. A quick way to do this is to produce a confusion matrix. In later chapters, we will examine the version provided by the caret package. There is also a version provided in the InformationValue package. This is where we will need the outcome as 0's and 1's. The default value by which the function selects either benign or malignant is 0.50, which is to say that any probability at or above 0.50 is classified as malignant:

    > trainY <- y[ind==1]
    > testY <- y[ind==2]
    > confusionMatrix(trainY, train.probs)
        0    1
    0 294    7
    1   8  165

The rows signify the predictions, and the columns signify the actual values. The diagonal elements are the correct classifications. The top right value, 7, is the number of false negatives, and the bottom left value, 8, is the number of false positives. We can also take a look at the error rate, as follows:

    > misClassError(trainY, train.probs)
    [1] 0.0316

It seems we have done a fairly good job with only a 3.16% error rate on the training set. As we previously discussed, we must be able to accurately predict unseen data, in other words, our test set.

The method to create a confusion matrix for the test set is similar to how we did it for the training data:

    > test.probs <- predict(full.fit, newdata = test, 
       type = "response")
    > misClassError(testY, test.probs)
    [1] 0.0239
    > confusionMatrix(testY, test.probs)
        0    1
    0 139    2
    1   3   65

It looks like we have done pretty well in creating a model with all the features. The roughly 98 percent prediction rate is quite impressive. However, we must still check whether there is room for improvement. Imagine that you or your loved one is a patient who has been diagnosed incorrectly. As previously mentioned, the implications can be quite dramatic. With this in mind, is there perhaps a better way to create a classification algorithm? Let's have a look!

Table of Contents for The logistic regression model

Create new playlist

Sign In

Sign Up

Table of Contents for
The logistic regression model