Discriminant analysis application

LDA is performed in the MASS package, which we have already loaded so that we can access the biopsy data. The syntax is very similar to the lm() and glm() functions.

We can now begin fitting our LDA model, which is as follows:

    > lda.fit <- lda(class ~ ., data = train)
    > lda.fit
    Call:
    lda(class ~ ., data = train)
    Prior probabilities of groups:
       benign malignant
    0.6371308 0.3628692
    Group means:
    thick  u.size u.shape   adhsn  s.size    nucl   
      chrom
    benign    2.9205 1.30463 1.41390 1.32450 2.11589 
      1.39735 2.08278
    malignant 7.1918 6.69767 6.68604 5.66860 5.50000 
      7.67441 5.95930
                n.nuc     mit
    benign    1.22516 1.09271
    malignant 5.90697 2.63953
    Coefficients of linear discriminants:
                    LD1
    thick    0.19557291
    u.size   0.10555201
    u.shape  0.06327200
    adhsn    0.04752757
    s.size   0.10678521
    nucl     0.26196145
    chrom    0.08102965
    n.nuc    0.11691054
    mit     -0.01665454

This output shows us that Prior probabilities of groups are approximately 64 percent for benign and 36 percent for malignancy. Next is Group means. This is the average of each feature by their class. Coefficients of linear discriminants are the standardized linear combination of the features that are used to determine an observation's discriminant score. The higher the score, the more likely that the classification is malignant.

The plot() function in LDA will provide us with a histogram and/or the densities of the discriminant scores, as follows:

    > plot(lda.fit, type = "both")

The following is the output of the preceding command:

We can see that there is some overlap in the groups, indicating that there will be some incorrectly classified observations.

The predict() function available with LDA provides a list of three elements: class, posterior, and x. The class element is the prediction of benign or malignant, the posterior is the probability score of x being in each class, and x is the linear discriminant score. Let's just extract the probability of an observation being malignant:

    > train.lda.probs <- predict(lda.fit)$posterior[, 
      2]
    > misClassError(trainY, train.lda.probs)
    [1] 0.0401
    > confusionMatrix(trainY, train.lda.probs)
        0    1
    0 296   13
    1   6  159

Well, unfortunately, it appears that our LDA model has performed much worse than the logistic regression models. The primary question is to see how this will perform on the test data:

    > test.lda.probs <- predict(lda.fit, newdata = 
       test)$posterior[, 2]
    > misClassError(testY, test.lda.probs)
    [1] 0.0383
    > confusionMatrix(testY, test.lda.probs)
        0    1
    0 140    6
    1   2   61

That's actually not as bad as I thought, given the lesser performance on the training data. From a correctly classified perspective, it still did not perform as well as logistic regression (96 percent versus almost 98 percent with logistic regression).

We will now move on to fit a QDA model. In R, QDA is also part of the MASS package and the function is qda(). Building the model is rather straightforward again, and we will store it in an object called qda.fit, as follows:

    > qda.fit = qda(class ~ ., data = train) 
    > qda.fit
    Call:
    qda(class ~ ., data = train)
    Prior probabilities of groups:
       benign malignant
    0.6371308 0.3628692
    Group means:
    Thick u.size u.shape  adhsn s.size   nucl  chrom  
     n.nuc
    benign    2.9205 1.3046  1.4139 1.3245 2.1158 
      1.3973 2.0827 1.2251
    malignant 7.1918 6.6976  6.6860 5.6686 5.5000 
      7.6744 5.9593 5.9069
                   mit
    benign    1.092715
    malignant 2.639535

As with LDA, the output has Group means but does not have the coefficients because it is a quadratic function as discussed previously.

The predictions for the train and test data follow the same flow of code as with LDA:

    > train.qda.probs <- predict(qda.fit)$posterior[,          
      2]
    > misClassError(trainY, train.qda.probs)
    [1] 0.0422
    > confusionMatrix(trainY, train.qda.probs)
        0    1
    0 287    5
    1  15  167
    > test.qda.probs <- predict(qda.fit, newdata = 
      test)$posterior[, 2]
    > misClassError(testY, test.qda.probs)
    [1] 0.0526
    > confusionMatrix(testY, test.qda.probs)
        0    1
    0 132    1
    1  10   66

We can quickly tell that QDA has performed the worst on the training data with the confusion matrix, and it has classified the test set poorly with 11 incorrect predictions. In particular, it has a high rate of false positives.

Table of Contents for Discriminant analysis application

Create new playlist

Sign In

Sign Up

Table of Contents for
Discriminant analysis application