In order to study boosting in action, in this section we'll introduce a new prediction problem from the field of atmospheric physics. More specifically, we will analyze the patterns made by radiation on a telescope camera in order to predict whether a particular pattern came from gamma rays leaking into the atmosphere, or from regular background radiation.
Gamma rays leave distinctive elliptical patterns and so we can create a set of features to describe these. The data set we will use is the MAGIC Gamma Telescope data set, hosted by the UCI Machine Learning repository at http://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope. Our data consists of 19,020 observations of the following attributes:
Column name |
Type |
Definition |
---|---|---|
|
Numerical |
The major axis of the ellipse (mm) |
|
Numerical |
The minor axis of the ellipse (mm) |
|
Numerical |
Logarithm to the base ten of the sum of the content of all pixels in the camera photo |
|
Numerical |
Ratio of the sum of the two highest pixels over |
|
Numerical |
Ratio of the highest pixel over |
|
Numerical |
Distance from the highest pixel to the center, projected onto the major axis (mm) |
|
Numerical | |
|
Numerical |
Third root of the third moment along the minor axis (mm) |
|
Numerical |
Angle of the major axis with the vector to the origin (degrees) |
|
Numerical |
Distance from the origin to the center of the ellipse (mm) |
|
Binary |
Gamma rays (g) or Background Hadron Radiation (b) |
First, we will load the data into a data frame called magic
, recoding the CLASS
output variable to use classes 1
and -1
for gamma rays and background radiation respectively:
> magic <- read.csv("magic04.data", header = FALSE) > names(magic) <- c("FLENGTH", "FWIDTH", "FSIZE", "FCONC", "FCONC1", "FASYM", "FM3LONG", "FM3TRANS", "FALPHA", "FDIST", "CLASS") > magic$CLASS <- as.factor(ifelse(magic$CLASS =='g', 1, -1))
Next, we'll split our data frame into a training data and a test data frame using our typical 80-20 split:
> library(caret) > set.seed(33711209) > magic_sampling_vector <- createDataPartition(magic$CLASS, p = 0.80, list = FALSE) > magic_train <- magic[magic_sampling_vector, 1:10] > magic_train_output <- magic[magic_sampling_vector, 11] > magic_test <- magic[-magic_sampling_vector, 1:10] > magic_test_output <- magic[-magic_sampling_vector, 11]
The model that we are going to use with boosting is a simple multilayer perceptron with a single hidden layer. Harkening back to Chapter 4, Neural Networks, we know that the nnet
package is perfect for this task. With neural networks, we often get superior accuracy when we normalize the inputs and so before training any models, we will carry out this preprocessing step:
> magic_pp <- preProcess(magic_train, method = c("center", "scale")) > magic_train_pp <- predict(magic_pp, magic_train) > magic_train_df_pp <- cbind(magic_train_pp, CLASS = magic_train_output) > magic_test_pp <- predict(magic_pp, magic_test)
Boosting is designed to work best with weak learners, and for this reason, we are going to use a very small number of hidden neurons in our hidden layer. Concretely, we will begin with the simplest possible multilayer perceptron that uses a single hidden neuron. To understand the effect of using boosting, we will establish our baseline performance by training a single neural network and measuring its accuracy. We can do this as follows:
> library(nnet) > n_model <- nnet(CLASS ~ ., data = magic_train_df_pp, size = 1) > n_test_predictions <- predict(n_model, magic_test_pp, type = "class") > (n_test_accuracy <- mean(n_test_predictions == magic_test_output)) [1] 0.7948988
This establishes that we have a baseline accuracy of around 79.5 percent. This isn't too bad, but we are going to use boosting to see if we can improve it. To that end, we are going to write our own function, AdaBoostNN()
, that will take as input a data frame, the name of the output variable, the number of single hidden layer neural network models we want to build, and finally the number of hidden units these neural networks will have. This function will then implement the AdaBoost algorithm that we previously described and finally return a list of models and their weights. Here is the function:
AdaBoostNN <- function(training_data, output_column, M, hidden_units) { require("nnet") models <- list() alphas <- list() n <- nrow(training_data) model_formula <- as.formula(paste(output_column, '~ .', sep = '')) w <- rep((1/n), n) for (m in 1:M) { model <- nnet(model_formula, data = training_data, size = hidden_units, weights = w) models[[m]] <- model predictions <- as.numeric(predict(model, training_data[, -which(names(training_data) == output_column)], type = "class")) errors <- predictions != training_data[, output_column] error_rate <- sum(w * as.numeric(errors)) / sum(w) alpha <- 0.5 * log((1 - error_rate) / error_rate) alphas[[m]] <- alpha temp_w <- mapply(function(x, y) if (y) { x * exp(alpha) } else { x * exp(-alpha)}, w, errors) w <- temp_w / sum(temp_w) } return(list(models = models, alphas = unlist(alphas))) }
Before proceeding, we will work our way through the function to understand what each line is doing. We first initialize empty lists of models and model weights (alphas
). We also compute the number of observations in our training data, storing this in the variable n
. The name of the output column provided is then used to create a formula that describes the neural network that we will build.
In our data set, this formula will be CLASS ~ .
, which means that the neural network will compute CLASS
as a function of all the other columns as input features. We then initialize our weights vector, as we did in step 1 of AdaBoost, and define our loop that will run for M iterations in order to build M models.
In every iteration, the first step is to use the current setting of the weights vector to train a neural network using as many hidden units as specified in the input, hidden_units
. We then compute a vector of predictions that this model generates on the training data using the predict()
function. By comparing these predictions to the output column of the training data, we calculate the errors that the current model makes on the training data. This then allows us to compute the error rate. According to step 4 of the AdaBoost algorithm, this error rate is set as the weight of the current model. Finally, the observation weights to be used in the next iteration of the loop are updated according to whether each observation was correctly classified using step 5 of the AdaBoost algorithm. The weight vector is then normalized and we are ready to begin the next iteration. After completing M iterations, we output a list of models and their corresponding model weights.
We now have a function that is able to train our ensemble classifier using AdaBoost but we also need a function to make predictions. This function will take in the output list produced by our training function, AdaBoostNN()
, along with a test data set. We've called this function AdaBoostNN.predict()
and it is shown here:
AdaBoostNN.predict <- function(ada_model, test_data) { models <- ada_model$models alphas <- ada_model$alphas prediction_matrix <- sapply(models, function (x) as.numeric(predict(x, test_data, type = "class"))) weighted_predictions <- t(apply(prediction_matrix, 1, function(x) mapply(function(y, z) y * z, x, alphas))) final_predictions <- apply(weighted_predictions, 1, function(x) sign(sum(x))) return(final_predictions) }
In this function, we first extract the models and the model weights from the list produced by our previous function. We then create a matrix of predictions, where each column corresponds to the vector of predictions made by a particular model. Thus, we will have as many columns in this matrix as models that we used for boosting.
We then multiply the predictions produced by each model with their corresponding model weight. For example, every prediction from the first model is in the first column of the prediction matrix and will have its value multiplied by the first model weight α1. Finally, in the last step, we reduce our matrix of weighted observations into a single vector of observations by summing the weighted predictions for each observation and taking the sign of the result. This vector of predictions is then returned by our function.
As an experiment, we will train ten neural network models with a single hidden unit and see if boosting improves accuracy:
> ada_model <- AdaBoostNN(magic_train_df_pp, 'CLASS', 10, 1) > predictions <- AdaBoostNN.predict(ada_model, magic_test_pp, 'CLASS') > mean(predictions == magic_test_output) [1] 0.804365
Boosting ten models seems to give us a marginal improvement in accuracy, but perhaps training more models might make more of a difference. We are also interested in the relationship between the complexity of our weak learner, as measured by the number of hidden neurons, and the performance benefits we can expect from boosting on this data set. The following plot shows the results of experimenting with our functions using different numbers of models as well as hidden neurons:
For the neural networks with one hidden unit, as the number of boosting models increases, we see an improvement in accuracy, but after 100 models, this tapers off and is actually slightly less for 200 models. The improvement over the baseline of a single model is substantial for these networks. When we increase the complexity of our learner by having a hidden layer with three hidden neurons, we get a much smaller improvement in performance. At 200 models, both ensembles perform at a similar level, indicating that at this point our accuracy is being limited by the type of model we are training.