Classification tree

For the classification problem, we will prepare the breast cancer data in the same fashion as we did in Chapter 3, Logistic Regression and Discriminant Analysis. After loading the data, you will delete the patient ID, rename the features, eliminate the few missing values, and then create the train/test datasets in the following way:

  > data(biopsy)
  > biopsy <- biopsy[, -1] #delete ID
  > names(biopsy) <- c("thick", "u.size", "u.shape", "adhsn", 
    "s.size", "nucl",
  "chrom", "n.nuc", "mit", "class") #change the feature names
  > biopsy.v2 <- na.omit(biopsy) #delete the observations with 
    missing values
  > set.seed(123) #random number generator
  > ind <- sample(2, nrow(biopsy.v2), replace = TRUE, prob = c(0.7, 
    0.3))
  > biop.train <- biopsy.v2[ind == 1, ] #the training data set
  > biop.test <- biopsy.v2[ind == 2, ] #the test data set

With the data set up appropriately, we will use the same syntax style for a classification problem as we did previously for a regression problem, but before creating a classification tree, we will need to ensure that the outcome is Factor, which can be done using the str() function:

  > str(biop.test[, 10])
   Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 2 1 1 ...

First, create the tree and then examine the table for the optimal number of splits:

  > set.seed(123)
  > tree.biop <- rpart(class ~ ., data = biop.train)
  > tree.biop$cptable
    CP nsplit rel error  xerror    xstd
  1 0.79651163   0 1.0000000 1.0000000 0.06086254
  2 0.07558140   1 0.2034884 0.2674419 0.03746996
  3 0.01162791   2 0.1279070 0.1453488 0.02829278
  4 0.01000000   3 0.1162791 0.1744186 0.03082013

The cross-validation error is at a minimum with only two splits (row 3). We can now prune the tree, plot the pruned tree, and see how it performs on the test set:

  > cp <- min(tree.biop$cptable[3, ])
  > prune.tree.biop <- prune(tree.biop, cp = cp)
  > plot(as.party(prune.tree.biop))

The output of the preceding command is as follows:

An examination of the tree plot shows that the uniformity of the cell size is the first split, then nuclei. The full tree had an additional split at the cell thickness. We can predict the test observations using type="class" in the predict() function, as follows:

  > rparty.test <- predict(prune.tree.biop, newdata = biop.test, type 
   ="class")  
  > table(rparty.test, biop.test$class)
  rparty.test benign malignant
   benign    136     3
   malignant   6    64
  > (136+64)/209
  [1] 0.9569378

The basic tree with just two splits gets us almost 96 percent accuracy. This still falls short of 97.6 percent with logistic regression but should encourage us to believe that we can improve on this with the upcoming methods, starting with random forests.

Table of Contents for Classification tree

Create new playlist

Sign In

Sign Up

Table of Contents for
Classification tree