Implementing dropout to avoid overfitting

Dropout is defined in the network architecture after the activation layers, and it randomly sets activations to zero. In other words, dropout randomly deletes parts of the neural network, which allows us to prevent overfitting. We can't overfit exactly to our training data when we're consistently throwing away information learned along the way. This allows our neural network to learn to generalize better.

In MXNet, dropout can be easily defined as part of network architecture using the mx.symbol.Dropout function. For example, the following code defines dropouts post the first ReLU activation (act1) and second ReLU activation (act2):

dropout1 <- mx.symbol.Dropout(data = act1, p = 0.5)
dropout2 <- mx.symbol.Dropout(data = act2, p = 0.3)

The data parameter specifies the input that the dropout takes and the value of p specifies the amount of dropout to be done. In case of dropout1, we are specifying that 50% of weights are to be dropped. Again, there is no hard-and-fast rule in terms of how much dropout should be included and at what layers. This is something to be determined through trial and error. The code with dropouts almost remains identical to the earlier project except that it now includes the dropouts after the activations:

# code to read the dataset and transform it to train.x and train.y remains # same as earlier project, therefore that code is not shown here
# including the required mxnet library
library(mxnet)
# defining the input layer in the network architecture
data <- mx.symbol.Variable("data")
# defining the first hidden layer with 128 neurons and also naming the # layer as fc1
# passing the input data layer as input to the fc1 layer
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
# defining the ReLU activation function on the fc1 output and also naming the layer as ReLU1
act1 <- mx.symbol.Activation(fc1, name="ReLU1", act_type="relu")
# defining a 50% dropout of weights learnt
dropout1 <- mx.symbol.Dropout(data = act1, p = 0.5)
# defining the second hidden layer with 64 neurons and also naming the layer as fc2
# passing the previous dropout output as input to the fc2 layer
fc2 <- mx.symbol.FullyConnected(dropout1, name="fc2", num_hidden=64)
# defining the ReLU activation function on the fc2 output and also naming the layer as ReLU2
act2 <- mx.symbol.Activation(fc2, name="ReLU2", act_type="relu")
# defining a dropout with 30% weight drop
dropout2 <- mx.symbol.Dropout(data = act2, p = 0.3)
# defining the third and final hidden layer in our network with 10 neurons and also naming the layer as fc3
# passing the previous dropout output as input to the fc3 layer
fc3 <- mx.symbol.FullyConnected(dropout2, name="fc3", num_hidden=10)
# defining the output layer with softmax activation function to
obtain class probabilities
softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")
# defining that the experiment should run on cpu
devices <- mx.cpu()
# setting the seed for the experiment so as to ensure that the results are reproducible
mx.set.seed(0)
# building the model with the network architecture defined above
model <- mx.model.FeedForward.create(softmax, X=train.x, y=train.y, ctx=devices, num.round=50, array.batch.size=100,array.layout = "rowmajor", learning.rate=0.07, momentum=0.9, eval.metric=mx.metric.accuracy, initializer=mx.init.uniform(0.07), epoch.end.callback=mx.callback.log.train.metric(100))
# making predictions on the test dataset
preds <- predict(model, test)
# verifying the predicted output
print(dim(preds))
# getting the label for each observation in test dataset; the predicted class is the one with highest probability
pred.label <- max.col(t(preds)) - 1
# observing the distribution of predicted labels in the test
dataset
print(table(pred.label))
# including the rfUtilities library so as to use accuracy function
library(rfUtilities)
# obtaining the performance of the model
print(accuracy(pred.label,test.y))
# printing the network architecture
graph.viz(model$symbol)

This will give the following output and the visual network architecture:

[35] Train-accuracy=0.958950003186862
[36] Train-accuracy=0.958983335793018
[37] Train-accuracy=0.958083337446054
[38] Train-accuracy=0.959683336317539
[39] Train-accuracy=0.95990000406901
[40] Train-accuracy=0.959433337251345
[41] Train-accuracy=0.959066670437654
[42] Train-accuracy=0.960250004529953
[43] Train-accuracy=0.959983337720235
[44] Train-accuracy=0.960450003842513
[45] Train-accuracy=0.960150004227956
[46] Train-accuracy=0.960533337096373
[47] Train-accuracy=0.962033336758614
[48] Train-accuracy=0.96005000303189
[49] Train-accuracy=0.961366670827071
[50] Train-accuracy=0.961350003282229
[1] 10 10000
pred.label
0 1 2 3 4 5 6 7 8 9
984 1143 1042 1022 996 902 954 1042 936 979
Accuracy (PCC): 97.3%
Cohen's Kappa: 0.97
Users accuracy:
0 1 2 3 4 5 6 7 8 9
98.7 98.9 98.1 97.6 98.2 97.3 97.6 97.4 94.3 94.7
Producers accuracy:
0 1 2 3 4 5 6 7 8 9
98.3 98.3 97.1 96.5 96.8 96.2 98.0 96.1 98.1 97.7
Confusion matrix
y
x 0 1 2 3 4 5 6 7 8 9
0 967 0 0 0 0 2 5 1 6 3
1 0 1123 3 0 1 1 3 5 2 5
2 1 2 1012 4 3 0 0 14 4 2
3 2 1 4 986 0 6 1 3 12 7
4 0 0 3 0 964 2 5 0 5 17
5 2 3 0 9 0 868 7 0 9 4
6 3 2 0 0 5 3 935 0 6 0
7 4 1 9 4 3 3 0 1001 6 11
8 1 3 1 2 1 3 2 1 918 4
9 0 0 0 5 5 4 0 3 6 956

Take a look at the following diagram:

We can see from the output that dropout is now included as part of the network architecture. We also observe that this network architecture yields a lower accuracy on the test dataset when compared with our initial project. One reason could be that the dropout percentages (50% and 30%) we included are too high. We could play with these percentages and rebuild the model to determine whether the accuracy gets better. The idea, however, is to demonstrate the use of dropout as a regularization technique so as to avoid overfitting in deep neural networks.

Apart from dropout, there are other techniques you could employ to avoid an overfitting situation:

  • Addition of data: Adding more training data.
  • Data augmentation: Creating additional data synthetically by applying techniques such as flipping, distorting, adding random noise, and rotation. The following screenshot shows sample images created after applying data augmentation:

Sample images from applying data augmentation
  • Reducing complexity of the network architecture: Fewer layers, fewer epochs, and so on.
  • Batch normalization: A process of ensuring that the weights generated in the network do not push very high or very low. This is generally achieved by subtracting the mean and dividing by the standard deviation of all weights at a layer from each weight in a layer. It shields against overfitting, performs regularization, and significantly improves the training speed.  The mx.sym.batchnorm() function enables us to define batch normalization after the activation. 

We will not focus on developing another project with batch normalization as using this function in the project is very similar to the other functions we used in our earlier projects. So far, we have focused on increasing the epochs to improve the performance of the model, another option is to try a different architecture and evaluate whether that improves the accuracy on the test dataset. On that note, let's explore LeNet, which is specifically designed for optical character recognition in documents.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset