In Chapter 3, Linear Regression, we analyzed the glass identification dataset, whose task is to identify the type of glass comprising a glass fragment found at a crime scene. The output of this dataset is a factor with several class levels corresponding to different types of glass. Our previous approach was to build a one-versus-all model using multinomial logistic regression. The results were not very promising, and one of the main points of concern was a poor model fit on the training data.
In this section, we will revisit this dataset and see whether a neural network model can do better. At the same time, we will demonstrate how neural networks can handle classification problems as well:
> glass <- read.csv("glass.data", header = FALSE) > names(glass) <- c("id", "RI", "Na", "Mg", "Al", "Si", "K", "Ca", "Ba", "Fe", "Type") > glass$id <- NULL
Our output is a multiclass factor and so we will want to dummy-encode this into binary columns. With the neuralnet
package, we would normally need to do this manually as a preprocessing step before we can build our model.
In this section, we will look at a second package that contains functions for building neural networks, nnet
. This is actually the same package that we used for multinomial logistic regression. One of the benefits of this package is that for multiclass classification, the nnet()
function that trains the neural network will automatically detect outputs that are factors and perform the dummy encoding for us. With that in mind, we will prepare a training and test set:
> glass$Type <- factor(glass$Type) > set.seed(4365677) > glass_sampling_vector <- createDataPartition(glass$Type, p = 0.80, list = FALSE) > glass_train <- glass[glass_sampling_vector,] > glass_test <- glass[-glass_sampling_vector,]
Next, just as with our previous dataset, we will normalize our input data:
> glass_pp <- preProcess(glass_train[1:9], method = c("range")) > glass_train <- cbind(predict(glass_pp, glass_train[1:9]), Type = glass_train$Type) > glass_test <- cbind(predict(glass_pp, glass_test[1:9]), Type = glass_test$Type)
We are now ready to train our model. Whereas the neuralnet
package is able to model multiple hidden layers, the nnet
package is designed to model neural networks with a single hidden layer. As a result, we still specify a formula as before, but this time, instead of a hidden
parameter that can be either a scalar or a vector of integers, we specify a size
parameter that is an integer representing the number of nodes in the single hidden layer of our model.
Also, the default neural network model in the nnet
package is for classification, as the output layer uses a logistic activation function. It is really important when working with different packages for training the same type of model, such as multilayer perceptrons, to check the default values for the various model parameters, as these will be different from package to package. One other difference between the two packages that we will mention here is that nnet
currently does not offer any plotting capabilities. Without further ado, we will now train our model:
> glass_model <- nnet(Type ~ ., data = glass_train, size = 10) # weights: 166 initial value 343.685179 iter 10 value 265.604188 iter 20 value 220.518320 iter 30 value 194.637078 iter 40 value 192.980203 iter 50 value 192.569751 iter 60 value 192.445198 iter 70 value 192.421655 iter 80 value 192.415382 iter 90 value 192.415166 iter 100 value 192.414794 final value 192.414794 stopped after 100 iterations
From the output, we can see that the model has not converged, stopping after the default value of 100 iterations. To converge, we can either rerun this code a number of times or we can increase the number of allowed iterations to 1,000 using the maxit
parameter:
> glass_model <- nnet(Type ~ ., data = glass_train, size = 10, maxit = 1000)
Let's first investigate the accuracy of our model on the training data in order to assess the quality of fit. To compute predictions, we use the predict()
function and specify the type parameter to be class
. This lets the predict()
function know that we want the class with highest probability to be selected. If we want to see the probabilities of each class, we can specify the value response
for the type
parameter. Finally, remember that we must pass in a data frame without the outputs to the predict()
function, and thus the need to subset the training data frame:
> train_predictions <- predict(glass_model, glass_train[,1:9], type = "class") > mean(train_predictions == glass_train$Type) [1] 0.7183908046
Our first attempt shows us that we are getting the same quality of fit as with our multinomial logistic regression model. To improve upon this, we'll increase the complexity of the model by adding more neurons in our hidden layer. We will also increase our maxit
parameter to 10,000
as the model is more complex and might need more iterations to converge:
> glass_model2 <- nnet(Type ~ ., data = glass_train, size = 50, maxit = 10000) > train_predictions2 <- predict(glass_model2, glass_train[,1:9], type = "class") > mean(train_predictions2 == glass_train$Type) [1] 1
As we can see, we have now achieved 100 percent training accuracy. Now that we have a decent model fit, we can investigate our performance on the test set:
> test_predictions2 <- predict(glass_model2, glass_test[,1:9], type = "class") > mean(test_predictions2 == glass_test$Type) [1] 0.6
Even though our model fits the training data perfectly, we see that the accuracy on the test set is only 60 percent. Even factoring in that the dataset is very small, this discrepancy is a classic signal that our model is overfitting on the training data. When we looked at linear and logistic regression, we saw that there are shrinkage methods, such as the lasso, which are designed to combat overfitting by restricting the size of the coefficients in the model.
An analogous technique known as weight decay exists for neural networks. With this approach, the product of a decay constant and the sum of the squares of all the network weights is added to the cost function. This limits any weights from taking overly large values and thus performs regularization on the network. Whereas there is currently no option for regularization with neuralnet()
, nnet()
uses the decay
parameter:
> glass_model3 <- nnet(Type~., data = glass_train, size = 10, maxit = 10000, decay = 0.01) > train_predictions3 <- predict(glass_model3, glass_train[,1:9], type = "class") > mean(train_predictions3 == glass_train$Type) [1] 0.9367816092 > test_predictions3 <- predict(glass_model3, glass_test[,1:9], type = "class") > mean(test_predictions3 == glass_test$Type) [1] 0.775
With this model, the fit on our training data is still very high, and substantially higher than we achieved with multinomial logistic regression. On the test set, the performance is still worse than on the training set, but much better than we had before.
We won't spend any more time on the glass identification data. Instead, we will reflect on a few lessons learned before moving on. The first of these is that achieving good performance with a neural network, and sometimes even just reaching convergence, might be tricky. Training the model involves a random initialization of network weights and the final result is often quite sensitive to these starting conditions. We can convince ourselves of this fact by training the different model configurations we have seen so far a number of times and noticing that certain configurations on some runs might not converge, and the performance on our training and test set does tend to differ from one run to the next.
Another insight is that training a neural network involves tuning a diverse range of parameters, from the number and arrangement of hidden neurons to the value of the decay
parameter. Others that we did not experiment with include the choice of nonlinear activation function to use with the hidden layer neurons, the criteria for convergence, and the particular cost function we use to fit our model. For example, instead of using least squares, we could use a criterion known as entropy.
Before settling on a final choice of model, therefore, it pays to try out as many different combinations of these as possible. A good place to experiment with different parameter combinations is the train()
function of the caret
package. It provides a unified interface for both neural network packages we have seen and, in conjunction with expand.grid()
, allows the simultaneous training and evaluation of several different neural network configurations. We'll provide just a vignette here, and the interested reader can use this to continue their investigation further:
> library(caret) > nnet_grid <- expand.grid(.decay = c(0.1, 0.01, 0.001, 0.0001), .size = c(50, 100, 150, 200, 250)) > nnetfit <- train(Type ~ ., data = glass_train, method = "nnet", maxit = 10000, tuneGrid = nnet_grid, trace = F, MaxNWts = 10000)