Breast cancer detection using darch

In this section, we will use the darch package, which is used for deep architectures and Restricted Boltzmann Machines (RBM). The darch package is built on the basis of the code from G. E. Hinton and R. R. Salakhutdinov (available under MATLAB code for Deep Belief Nets (DBN)). This package is for generating neural networks with many layers (deep architectures) and training them with the method introduced by the authors.
This method includes a pre-training with the contrastive divergence method and fine-tuning with commonly known training algorithms such as backpropagation or conjugate gradients. Additionally, supervised fine-tuning can be enhanced with maxout and dropout, two recently developed techniques used to improve fine-tuning for deep learning.

The basis of the example is classification based on a set of inputs. To do this, we will use the data contained in the dataset named BreastCancer.csv that we just used in Chapter 5, Training and Visualizing a Neural Network in R. This data has been taken from the UCI Repository Of Machine Learning. The dataset is periodically updated as soon as Dr. Wolberg reports his clinical cases. The data is of breast cancer patients with a classification of benign or malignant tumor based on a set of ten independent variables.

To get the data, we draw on the large collection of data available in the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml.

Details of the data are as follows:

  • Number of instances: 699 (as of 15 July 1992)
  • Number of attributes: 10 plus the class attribute
  • Attribute information: The class attribute has been moved to the last column

The description of the attributes is shown here:

   #  Attribute                     Domain
   -- -----------------------------------------
   1. Sample code number            id number
   2. Clump Thickness               1 - 10
   3. Uniformity of Cell Size       1 - 10
   4. Uniformity of Cell Shape      1 - 10
   5. Marginal Adhesion             1 - 10
   6. Single Epithelial Cell Size   1 - 10
   7. Bare Nuclei                   1 - 10
   8. Bland Chromatin               1 - 10
   9. Normal Nucleoli               1 - 10
  10. Mitoses                       1 - 10
  11. Class:                        (2 for benign, 4 for malignant)

To understand the darch function, we first set up an XOR gate and then use it for training and verification. The darch function uses output data and input attributes to build the model, which can be tested internally by darch itself. In this case, we achieve 0 percent error and 100 percent accuracy.

Next, we use the breast cancer data to build the darch model and then check the accuracy:

#####################################################################
####Chapter 7 - Neural Networks with R #########
####Breast Cancer Detection using darch package #########
#####################################################################
library("mlbench")
library("darch")

data(BreastCancer)
summary(BreastCancer)

data_cleaned <- na.omit(BreastCancer)
summary(data_cleaned)

model <- darch(Class ~ ., data_cleaned,layers = c(10, 10, 1),
darch.numEpochs = 50, darch.stopClassErr = 0, retainData = T)

plot(model)

predictions <- predict(model, newdata = data_cleaned, type = "class")
cat(paste("Incorrect classifications:", sum(predictions != data_cleaned[,11])))
table(predictions,data_cleaned[,11])

library(gmodels)
CrossTable(x = data_cleaned$Class, y = predictions,
prop.chisq=FALSE)

We begin analyzing the code line by line, explaining in detail all the features applied to capture the results:

library("mlbench")
library("darch")

The first two lines of the initial code are used to load the libraries needed to run the analysis.

Remember that, to install a library that is not present in the initial distribution of R, you must use the install.package function. This is the main function to install packages. It takes a vector of names and a destination library, downloads the packages from the repositories and installs them. This function should be used only once and not every time you run the code.

The mlbench library contains a collection of artificial and real-world machine learning benchmark problems, including, for example, several datasets from the UCI repository.

The darch library is a package for deep architectures and RBM:

data(BreastCancer)

With this command, we upload the dataset named BreastCancer, as mentioned, in the mlbench library. Let's see now that it's inside:

summary(BreastCancer)

With this command, we see a brief summary by using the summary() function.

Remember, the summary() function is a generic function used to produce result summaries of the results of various model fitting functions. The function invokes particular methods that depend on the class of the first argument.

In this case, the function has been applied to a dataframe and the results are listed in the following figure:

The summary() function returns a set of statistics for each variable. In particular, it is useful to highlight the result provided for the Class variable, which contains the diagnosis of the cancer mass. In this case, 458 cases of benign class and 241 cases of malignant class were detected. Another feature to highlight is the Bare.nuclei variable. For this variable, 16 cases of missing values were detected.

To remove missing values, we can use the na.omit() function:

data_cleaned <- na.omit(BreastCancer) 

Now we build and train the model:

model <- darch(Class ~ ., data_cleaned,layers = c(10, 10, 1),
darch.numEpochs = 50, darch.stopClassErr = 0, retainData = T)

To evaluate the model performance, we can plot the raw network error:

plot(model)

The plot of error versus epoch is shown in the following figure:

We get the minimum error at 34 epochs.

We finally have the network trained and ready for use; now we can use it to make our predictions:

predictions <- predict(model, newdata = data_cleaned, type = "class")

We used the entire set of data at our disposal to make our forecast using the model. All we have to do is compare the results obtained with the model predictions and the data available in the dataset:

cat(paste("Incorrect classifications:", sum(predictions != data_cleaned[,11])))

The results are shown as follows:

> cat(paste("Incorrect classifications:", sum(predictions != data_cleaned[,11])))
Incorrect classifications: 2

The results are really good! Only two wrong classifications! I would say that we can be content with the fact that they started from 683 observations. To better understand what the errors were, we build a confusion matrix:

table(predictions,data_cleaned[,11])

The results are shown here:

> table(predictions,data_cleaned[,11])

predictions benign malignant
benign 443 1
malignant 1 238

Although in a simple way, the matrix tells us that we only made two errors equally distributed between the two values of the class. For more information on the confusion matrix, we can use the CrossTable() function contained in the gmodels package. As always, before loading the book, you need to install it:

library(gmodels)
CrossTable(x = data_cleaned$Class, y = predictions,
prop.chisq=FALSE)

The confusion matrix obtained by using the CrossTable() function is shown in the following figure:

As we had anticipated in the classification, our model has only two errors: FP and FN. Then calculate the accuracy; as indicated in Chapter 2, Learning Processes in Neural Networks, it is given by the following formula:

Let's calculate the accuracy in R environment:

> Accuracy = (443+238)/683
> Accuracy
[1] 0.9970717

As mentioned before, the classifier has achieved excellent results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset