So far, we've learned about the basic building blocks of convolutional neural networks. The concepts illustrated in this chapter are not really more difficult than traditional multilayer neural networks. Intuitively, we can say that the most important operation in a traditional neural network is the matrix-vector multiplication.
For instance, we use matrix-vector multiplications to pre-activations (or net input) as in . Here, x is a column vector representing pixels, and W is the weight matrix connecting the pixel inputs to each hidden unit. In a convolutional neural network, this operation is replaced by a convolution operation, as in , where X is a matrix representing the pixels in a height x width arrangement. In both cases, the pre-activations are passed to an activation function to obtain the activation of a hidden unit , where is the activation function. Furthermore, recall that subsampling is another building block of a convolutional neural network, which may appear in the form of pooling, as we described in the previous section.
An input sample to a convolutional layer may contain one or more 2D arrays or matrices with dimensions (for example, the image height and width in pixels). These matrices are called channels. Therefore, using multiple channels as input to a convolutional layer requires us to use a rank-3 tensor or a three-dimensional array: , where is the number of input channels.
For example, let's consider images as input to the first layer of a CNN. If the image is colored and uses the RGB color mode, then (for the red, green, and blue color channels in RGB). However, if the image is in grayscale, then we have because there is only one channel with the grayscale pixel intensity values.
When we work with images, we can read images into NumPy arrays using the 'uint8'
(unsigned 8-bit integer) data type to reduce memory usage compared to 16-bit, 32-bit, or 64-bit integer types, for example. Unsigned 8-bit integers take values in the range [0, 255], which are sufficient to store the pixel information in RGB images, which also take values in the same range.
Next, let's look at an example of how we can read in an image into our Python session using SciPy. However, please note that reading images with SciPy requires that you have the Python Imaging Library (PIL) package installed. We can install Pillow (https://python-pillow.org), a more user-friendly fork of PIL, to satisfy those requirements, as follows:
pip install pillow
Once Pillow is installed, we can use the imread
function from the scipy.misc
module to read an RGB image (this example image is located in the code bundle folder that is provided with this chapter at https://github.com/rasbt/python-machine-learning-book-2nd-edition/tree/master/code/ch15):
>>> import scipy.misc >>> img = scipy.misc.imread('./example-image.png', ... mode='RGB') >>> print('Image shape:', img.shape) Image shape: (252, 221, 3) >>> print('Number of channels:', img.shape[2]) Number of channels: 3 >>> print('Image data type:', img.dtype) Image data type: uint8 >>> print(img[100:102, 100:102, :]) [[[179 134 110] [182 136 112]] [[180 135 111] [182 137 113]]]
Now that we have familiarized ourselves with the structure of input data, the next question is how can we incorporate multiple input channels in the convolution operation that we discussed in the previous sections?
Please note that the imread function, as well as other image processing utilities from scipy.misc
, have been outsourced into a separate library, imageio
. Hence, in future versions of SciPy, scipy.misc.imread
might not work anymore. In this case, you can use the equivalent imageio.imread
function after installing imageio via pip
and import
The answer is very simple: we perform the convolution operation for each channel separately and then add the results together using the matrix summation. The convolution associated with each channel (c) has its own kernel matrix as . The total pre-activation result is computed in the following formula:
The final result, h, is called a feature map. Usually, a convolutional layer of a CNN has more than one feature map. If we use multiple feature maps, the kernel tensor becomes four-dimensional: . Here, width x height is the kernel size, is the number of input channels, and is the number of output feature maps. So, now let's include the number of output feature maps in the preceding formula and update it as follows:
To conclude our discussion of computing convolutions in the context of neural networks, let's look at the example in the following figure that shows a convolutional layer, followed by a pooling layer.
In this example, there are three input channels. The kernel tensor is four-dimensional. Each kernel matrix is denoted as , and there are three of them, one for each input channel. Furthermore, there are five such kernels, accounting for five output feature maps. Finally, there is a pooling layer for subsampling the feature maps, as shown in the following figure:
How many trainable parameters exist in the preceding example?
To illustrate the advantages of convolution, parameter-sharing and sparse-connectivity, let's work through an example. The convolutional layer in the network shown in the preceding figure is a four-dimensional tensor. So, there are parameters associated with the kernel. Furthermore, there is a bias vector for each output feature map of the convolutional layer. Thus, the size of the bias vector is 5. Pooling layers do not have any (trainable) parameters; therefore, we can write the following:
If input tensor is of size , assuming that the convolution is performed with mode='same', then the output feature maps would be of size .
Note that this number is much smaller than the case if we wanted to have a fully connected layer instead of the convolution layer. In the case of a fully connected layer, the number of parameters for the weight matrix to reach the same number of output units would have been as follows:
In the next section, we will talk about how to regularize a neural network.
Choosing the size of a network, whether we are dealing with a traditional (fully connected) neural network or a CNN, has always been a challenging problem. For instance, the size of a weight matrix and the number of layers need to be tuned to achieve a reasonably good performance.
The capacity of a network refers to the level of complexity of the function that it can learn. Small networks, networks with a relatively small number of parameters, have a low capacity and are therefore likely to be under fit, resulting in poor performance since they cannot learn the underlying structure of complex datasets.
Yet, very large networks may more easily result in overfitting, where the network will memorize the training data and do extremely well on the training set while achieving poor performance on the held-out test set. When we deal with real-world machine learning problems, we do not know how large the network should be a priori.
One way to address this problem is to build a network with a relatively large capacity (in practice, we want to choose a capacity that is slightly larger than necessary) to do well on the training set. Then, to prevent overfitting, we can apply one or multiple regularization schemes to achieve good generalization performance on new data, such as the held-out test set. A popular choice for regularization is L2 regularization, which we discussed previously in this book.
In recent years, another popular regularization technique called dropout has emerged that works amazingly well for regularizing (deep) neural networks (Dropout: a simple way to prevent neural networks from overfitting, Nitish Srivastava and. others, Journal of Machine Learning Research 15.1, pages 1929-1958, 2014, http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf).
Intuitively, dropout can be considered as the consensus (averaging) of an ensemble of models. In ensemble learning, we train several models independently. During prediction, we then use the consensus of all the trained models. However, both training several models and collecting and averaging the output of multiple models is computationally expensive. Here, dropout offers a workaround with an efficient way to train many models at once and compute their average predictions at test or prediction time.
Dropout is usually applied to the hidden units of higher layers. During the training phase of a neural network, a fraction of the hidden units is randomly dropped at every iteration with probability (or the keep probability ).
This dropout probability is determined by the user and the common choice is , as discussed in the previously mentioned article by Nitish Srivastava and others, 2014. When dropping a certain fraction of input neurons, the weights associated with the remaining neurons are rescaled to account for the missing (dropped) neurons.
The effect of this random dropout forces the network to learn a redundant representation of the data. Therefore, the network cannot rely on an activation of any set of hidden units since they may be turned off at any time during training and is forced to learn more general and robust patterns from the data.
This random dropout can effectively prevent overfitting. The following figure shows an example of applying dropout with probability during the training phase, thereby half of the neurons become inactive randomly. However, during prediction, all neurons will contribute to computing the pre-activations of the next layer.
As shown here, one important point to remember is that units may drop randomly during training only, while for the evaluation phase, all the hidden units must be active (for instance, or ). To ensure that the overall activations are on the same scale during training and prediction, the activations of the active neurons have to be scaled appropriately (for example, by halving the activation if the dropout probability was set to ).
However, since it is inconvenient to always scale activations when we make predictions in practice, TensorFlow and other tools scale the activations during training (for example, by doubling the activations if the dropout probability was set to ).
So, what is the relationship between dropout and ensemble learning? Since we drop different hidden neurons at each iteration, effectively we are training different models. When all these models are finally trained, we set the keep probability to 1 and use all the hidden units. This means we are taking the average activation from all the hidden units.