Implementing a deep learning network for handwritten digit recognition

The mxnet library offers several functions that enable us to define the layers and activations that comprise the deep learning network. The definition of layers, the usage of activation functions, and the number of neurons to be used in each of the hidden layers is generally termed the network architecture. Deciding on the network architecture is more of an art than a science. Often, several iterations of experiments may be needed to decide on the right architecture for the problem. We call it an art as there are no exact rules for finding the ideal architecture. The number of layers, neurons in these layers, and the type of layers are pretty much decided through trial and error. 

In this section, we'll build a simple deep learning network with three hidden layers. Here is the general architecture of our network:

  1. The input layer is defined as the initial layer in the network. The mx.symbol.Variable MXNet function defines the input layer.
  2. A fully-connected layer is defined, also called a dense layer, with 128 neurons as the first hidden layer in the network. This can be done using the mx.symbol.FullyConnected MXNet function.
  3. A ReLU activation function is defined as part of the network. The mx.symbol.Activation function helps us to define the ReLU activation function as part of the network.
  4.  Define the second hidden layer; it is another dense layer with 64 neurons. This can be accomplished through the mx.symbol.FullyConnected function, similar to the first hidden layer.
  5. Apply a ReLU activation function on the second hidden layer's output. This can be done through the mx.symbol.Activation function.
  6. The final hidden layer in our network is another fully-connected layer, but with just ten outputs (equal to the number of classes). This can be done through the mx.symbol.FullyConnected function as well.
  7. The output layer needs to be defined and this should be probabilities of prediction for each class; therefore, we apply softmax at the output layer. The mx.symbol.SoftmaxOutput function enables us to configure the softmax in the output.
We are not saying that this is the best network architecture possible for the problem, but this is the network we are going to build to demonstrate the implementation of a deep learning network with MXNet.

Now that we have a blueprint in place, let's delve into coding the network using the following code block:

# setting the working directory
setwd('/home/sunil/Desktop/book/chapter 19/MNIST')
# function to load image files
load_image_file = function(filename) {
ret = list()
f = file(filename, 'rb')
readBin(f, 'integer', n = 1, size = 4, endian = 'big')
n = readBin(f, 'integer', n = 1, size = 4, endian = 'big')
nrow = readBin(f, 'integer', n = 1, size = 4, endian = 'big')
ncol = readBin(f, 'integer', n = 1, size = 4, endian = 'big')
x = readBin(f, 'integer', n = n * nrow * ncol, size = 1, signed
= FALSE)
close(f)
data.frame(matrix(x, ncol = nrow * ncol, byrow = TRUE))
}
# function to load the label files
load_label_file = function(filename) {
f = file(filename, 'rb')
readBin(f, 'integer', n = 1, size = 4, endian = 'big')
n = readBin(f, 'integer', n = 1, size = 4, endian = 'big')
y = readBin(f, 'integer', n = n, size = 1, signed = FALSE)
close(f)
y }
# loading the image files
train = load_image_file("train-images-idx3-ubyte")
test = load_image_file("t10k-images-idx3-ubyte")
# loading the labels
train.y = load_label_file("train-labels-idx1-ubyte")
test.y = load_label_file("t10k-labels-idx1-ubyte")
# lineaerly transforming the grey scale image i.e. between 0 and 255 to # 0 and 1
train.x <- data.matrix(train/255)
test <- data.matrix(test/255)
# verifying the distribution of the digit labels in train dataset
print(table(train.y))
# verifying the distribution of the digit labels in test dataset
print(table(test.y))

This will give the following output: 

train.y
0 1 2 3 4 5 6 7 8 9
5923 6742 5958 6131 5842 5421 5918 6265 5851 5949

test.y
0 1 2 3 4 5 6 7 8 9
980 1135 1032 1010 982 892 958 1028 974 1009

Now, define the three layers and start training the network to obtain class probabilities and ensure the results are reproducible using the following code block:

# including the required mxnet library 
library(mxnet)
# defining the input layer in the network architecture
data <- mx.symbol.Variable("data")
# defining the first hidden layer with 128 neurons and also naming the # layer as fc1
# passing the input data layer as input to the fc1 layer
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
# defining the ReLU activation function on the fc1 output and also # naming the layer as ReLU1
act1 <- mx.symbol.Activation(fc1, name="ReLU1", act_type="relu")
# defining the second hidden layer with 64 neurons and also naming the # layer as fc2
# passing the previous activation layer output as input to the
fc2 layer
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64)
# defining the ReLU activation function on the fc2 output and also
# naming the layer as ReLU2
act2 <- mx.symbol.Activation(fc2, name="ReLU2", act_type="relu")
# defining the third and final hidden layer in our network with 10
# neurons and also naming the layer as fc3
# passing the previous activation layer output as input to the
fc3 layer
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10)
# defining the output layer with softmax activation function to obtain # class probabilities
softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")
# defining that the experiment should run on cpu
devices <- mx.cpu()
# setting the seed for the experiment so as to ensure that the results # are reproducible
mx.set.seed(0)
# building the model with the network architecture defined above
model <- mx.model.FeedForward.create(softmax, X=train.x, y=train.y,
ctx=devices, num.round=10, array.batch.size=100,array.layout ="rowmajor",
learning.rate=0.07, momentum=0.9, eval.metric=mx.metric.accuracy,
initializer=mx.init.uniform(0.07),
epoch.end.callback=mx.callback.log.train.metric(100))

This will give the following output: 

Start training with 1 devices
[1] Train-accuracy=0.885783334343384
[2] Train-accuracy=0.963616671562195
[3] Train-accuracy=0.97510000983874
[4] Train-accuracy=0.980016676982244
[5] Train-accuracy=0.984233343303204
[6] Train-accuracy=0.986883342464765
[7] Train-accuracy=0.98848334223032
[8] Train-accuracy=0.990800007780393
[9] Train-accuracy=0.991300007204215
[10] Train-accuracy=0.991516673564911

To make predictions on the test dataset and get the label for each observation in the test dataset, use the following code block:

# making predictions on the test dataset
preds <- predict(model, test)
# verifying the predicted output
print(dim(preds))
# getting the label for each observation in test dataset; the
# predicted class is the one with highest probability
pred.label <- max.col(t(preds)) - 1
# observing the distribution of predicted labels in the test dataset
print(table(pred.label))

This will give the following output:

[1]    10 10000
pred.label
0 1 2 3 4 5 6 7 8 9
980 1149 1030 1021 1001 869 960 1001 964 1025

Let's check the performance of the model using the following code:

# obtaining the performance of the model
print(accuracy(pred.label,test.y))

This will give the following output: 

Accuracy (PCC): 97.73% 
Cohen's Kappa: 0.9748
Users accuracy:
0 1 2 3 4 5 6 7 8 9
98.8 99.6 98.0 97.7 98.3 96.1 97.9 96.3 96.6 97.7
Producers accuracy:
0 1 2 3 4 5 6 7 8 9
98.8 98.3 98.2 96.7 96.4 98.6 97.7 98.9 97.6 96.2
Confusion matrix
y
x 0 1 2 3 4 5 6 7 8 9
0 968 0 1 1 1 2 3 1 2 1
1 1 1130 3 0 0 1 3 8 1 2
2 0 1 1011 2 2 0 0 11 3 0
3 1 2 6 987 0 14 2 2 4 3
4 1 0 2 1 965 2 10 3 6 11
5 1 0 0 4 0 857 2 0 3 2
6 5 2 3 0 4 5 938 0 3 0
7 0 0 2 2 1 1 0 990 3 2
8 1 0 4 8 0 5 0 3 941 2
9 2 0 0 5 9 5 0 10 8 986

To visualize the network architecture, use the following code:

# Visualizing the network architecture
graph.viz(model$symbol)

This will give the following output: 

With the simple architecture running for a few minutes on a CPU-based laptop and with minimal effort, we were able to achieve an accuracy of 97.7% on the test dataset. The deep learning network was able to learn to interpret the digits by seeing the images it was given as input. The accuracy of the system can be further improved by altering the architecture or by increasing the number of iterations. It may be noted that, in the earlier experiment, we ran it for 10 iterations.

The number of iterations can simply be amended when model-building through the num.round parameter. There is no hard-and-fast rule in terms of the optimal number of rounds, so this is something to be determined by trial and error. Let's build the model with 50 iterations and observe its impact on performance. The code will remain the same as the earlier project, except with the following amendment to the model-building code:

model <- mx.model.FeedForward.create(softmax, X=train.x, y=train.y,
ctx=devices, num.round=50, array.batch.size=100,array.layout ="rowmajor",
learning.rate=0.07, momentum=0.9, eval.metric=mx.metric.accuracy,
initializer=mx.init.uniform(0.07),
epoch.end.callback=mx.callback.log.train.metric(100))
Observe that the num.round parameter is now set to 50, instead of the earlier value of 10.

This will give the following output:

[35] Train-accuracy=0.999933333396912
[36] Train-accuracy=1
[37] Train-accuracy=1
[38] Train-accuracy=1
[39] Train-accuracy=1
[40] Train-accuracy=1
[41] Train-accuracy=1
[42] Train-accuracy=1
[43] Train-accuracy=1
[44] Train-accuracy=1
[45] Train-accuracy=1
[46] Train-accuracy=1
[47] Train-accuracy=1
[48] Train-accuracy=1
[49] Train-accuracy=1
[50] Train-accuracy=1
[1] 10 10000
pred.label
0 1 2 3 4 5 6 7 8 9
992 1139 1029 1017 983 877 953 1021 972 1017
Accuracy (PCC): 98.21%
Cohen's Kappa: 0.9801
Users accuracy:
0 1 2 3 4 5 6 7 8 9
99.3 99.5 98.2 98.2 98.1 97.1 98.0 97.7 98.0 97.8
Producers accuracy:
0 1 2 3 4 5 6 7 8 9
98.1 99.1 98.4 97.5 98.0 98.7 98.5 98.3 98.3 97.1
Confusion matrix
y
x 0 1 2 3 4 5 6 7 8 9
0 973 0 2 2 1 3 5 1 3 2
1 1 1129 0 0 1 1 3 2 0 2
2 1 0 1013 1 3 0 0 9 2 0
3 0 1 5 992 0 10 1 1 3 4
4 0 0 2 0 963 2 7 1 1 7
5 0 0 0 4 1 866 2 0 2 2
6 2 2 1 0 3 5 939 0 1 0
7 0 1 6 3 1 1 0 1004 2 3
8 1 1 3 4 0 2 1 3 955 2
9 2 1 0 4 9 2 0 7 5 987

We can observe from the output that 100% accuracy was obtained with the training dataset. However, with the test dataset, we observe the accuracy as 98%. Essentially, our model is expected to perform the same with both the training and test dataset for it to be called a good model. Unfortunately, in this case, we have encountered a situation known as overfitting, which means that the model we created did not generalize well. In other words, the model has trained itself with too many parameters or it got trained for too long and has become super-specialized with data in the training dataset alone; as an effect, it is not doing a good job with new data. Model generalization is something we should specifically aim for. There is a technique, known as dropout, that can help us to overcome the overfitting issue.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset