Putting everything together to build a CNN

So far, we've learned about the basic building blocks of convolutional neural networks. The concepts illustrated in this chapter are not really more difficult than traditional multilayer neural networks. Intuitively, we can say that the most important operation in a traditional neural network is the matrix-vector multiplication.

For instance, we use matrix-vector multiplications to pre-activations (or net input) as in Putting everything together to build a CNN. Here, x is a column vector representing pixels, and W is the weight matrix connecting the pixel inputs to each hidden unit. In a convolutional neural network, this operation is replaced by a convolution operation, as in Putting everything together to build a CNN, where X is a matrix representing the pixels in a height x width arrangement. In both cases, the pre-activations are passed to an activation function to obtain the activation of a hidden unit Putting everything together to build a CNN, where Putting everything together to build a CNN is the activation function. Furthermore, recall that subsampling is another building block of a convolutional neural network, which may appear in the form of pooling, as we described in the previous section.

Working with multiple input or color channels

An input sample to a convolutional layer may contain one or more 2D arrays or matrices with dimensions Working with multiple input or color channels (for example, the image height and width in pixels). These Working with multiple input or color channels matrices are called channels. Therefore, using multiple channels as input to a convolutional layer requires us to use a rank-3 tensor or a three-dimensional array: Working with multiple input or color channels, where Working with multiple input or color channels is the number of input channels.

For example, let's consider images as input to the first layer of a CNN. If the image is colored and uses the RGB color mode, then Working with multiple input or color channels (for the red, green, and blue color channels in RGB). However, if the image is in grayscale, then we have Working with multiple input or color channels because there is only one channel with the grayscale pixel intensity values.

Tip

When we work with images, we can read images into NumPy arrays using the 'uint8' (unsigned 8-bit integer) data type to reduce memory usage compared to 16-bit, 32-bit, or 64-bit integer types, for example. Unsigned 8-bit integers take values in the range [0, 255], which are sufficient to store the pixel information in RGB images, which also take values in the same range.

Next, let's look at an example of how we can read in an image into our Python session using SciPy. However, please note that reading images with SciPy requires that you have the Python Imaging Library (PIL) package installed. We can install Pillow (https://python-pillow.org), a more user-friendly fork of PIL, to satisfy those requirements, as follows:

pip install pillow

Once Pillow is installed, we can use the imread function from the scipy.misc module to read an RGB image (this example image is located in the code bundle folder that is provided with this chapter at https://github.com/rasbt/python-machine-learning-book-2nd-edition/tree/master/code/ch15):

>>> import scipy.misc
>>> img = scipy.misc.imread('./example-image.png',
...                         mode='RGB')
>>> print('Image shape:', img.shape)
Image shape: (252, 221, 3)
>>> print('Number of channels:', img.shape[2])
Number of channels: 3
>>> print('Image data type:', img.dtype)
Image data type: uint8
>>> print(img[100:102, 100:102, :])
 [[[179 134 110]
  [182 136 112]]

 [[180 135 111]
  [182 137 113]]]

Now that we have familiarized ourselves with the structure of input data, the next question is how can we incorporate multiple input channels in the convolution operation that we discussed in the previous sections?

Please note that the imread function, as well as other image processing utilities from scipy.misc, have been outsourced into a separate library, imageio. Hence, in future versions of SciPy, scipy.misc.imread might not work anymore. In this case, you can use the equivalent imageio.imread function after installing imageio via pip and import

The answer is very simple: we perform the convolution operation for each channel separately and then add the results together using the matrix summation. The convolution associated with each channel (c) has its own kernel matrix as Working with multiple input or color channels. The total pre-activation result is computed in the following formula:

Working with multiple input or color channels

The final result, h, is called a feature map. Usually, a convolutional layer of a CNN has more than one feature map. If we use multiple feature maps, the kernel tensor becomes four-dimensional: Working with multiple input or color channels. Here, width x height is the kernel size, Working with multiple input or color channels is the number of input channels, and Working with multiple input or color channels is the number of output feature maps. So, now let's include the number of output feature maps in the preceding formula and update it as follows:

Working with multiple input or color channels

To conclude our discussion of computing convolutions in the context of neural networks, let's look at the example in the following figure that shows a convolutional layer, followed by a pooling layer.

In this example, there are three input channels. The kernel tensor is four-dimensional. Each kernel matrix is denoted as Working with multiple input or color channels, and there are three of them, one for each input channel. Furthermore, there are five such kernels, accounting for five output feature maps. Finally, there is a pooling layer for subsampling the feature maps, as shown in the following figure:

Working with multiple input or color channels

How many trainable parameters exist in the preceding example?

Tip

To illustrate the advantages of convolution, parameter-sharing and sparse-connectivity, let's work through an example. The convolutional layer in the network shown in the preceding figure is a four-dimensional tensor. So, there are Working with multiple input or color channels parameters associated with the kernel. Furthermore, there is a bias vector for each output feature map of the convolutional layer. Thus, the size of the bias vector is 5. Pooling layers do not have any (trainable) parameters; therefore, we can write the following:

Working with multiple input or color channels

If input tensor is of size Working with multiple input or color channels, assuming that the convolution is performed with mode='same', then the output feature maps would be of size Working with multiple input or color channels.

Note that this number is much smaller than the case if we wanted to have a fully connected layer instead of the convolution layer. In the case of a fully connected layer, the number of parameters for the weight matrix to reach the same number of output units would have been as follows:

Working with multiple input or color channels

Tip

Given that Working with multiple input or color channels and Working with multiple input or color channels, we can see that the difference in the number of trainable parameters is huge.

In the next section, we will talk about how to regularize a neural network.

Regularizing a neural network with dropout

Choosing the size of a network, whether we are dealing with a traditional (fully connected) neural network or a CNN, has always been a challenging problem. For instance, the size of a weight matrix and the number of layers need to be tuned to achieve a reasonably good performance.

The capacity of a network refers to the level of complexity of the function that it can learn. Small networks, networks with a relatively small number of parameters, have a low capacity and are therefore likely to be under fit, resulting in poor performance since they cannot learn the underlying structure of complex datasets.

Yet, very large networks may more easily result in overfitting, where the network will memorize the training data and do extremely well on the training set while achieving poor performance on the held-out test set. When we deal with real-world machine learning problems, we do not know how large the network should be a priori.

One way to address this problem is to build a network with a relatively large capacity (in practice, we want to choose a capacity that is slightly larger than necessary) to do well on the training set. Then, to prevent overfitting, we can apply one or multiple regularization schemes to achieve good generalization performance on new data, such as the held-out test set. A popular choice for regularization is L2 regularization, which we discussed previously in this book.

In recent years, another popular regularization technique called dropout has emerged that works amazingly well for regularizing (deep) neural networks (Dropout: a simple way to prevent neural networks from overfitting, Nitish Srivastava and. others, Journal of Machine Learning Research 15.1, pages 1929-1958, 2014, http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf).

Intuitively, dropout can be considered as the consensus (averaging) of an ensemble of models. In ensemble learning, we train several models independently. During prediction, we then use the consensus of all the trained models. However, both training several models and collecting and averaging the output of multiple models is computationally expensive. Here, dropout offers a workaround with an efficient way to train many models at once and compute their average predictions at test or prediction time.

Dropout is usually applied to the hidden units of higher layers. During the training phase of a neural network, a fraction of the hidden units is randomly dropped at every iteration with probability Regularizing a neural network with dropout (or the keep probability Regularizing a neural network with dropout).

This dropout probability is determined by the user and the common choice is Regularizing a neural network with dropout, as discussed in the previously mentioned article by Nitish Srivastava and others, 2014. When dropping a certain fraction of input neurons, the weights associated with the remaining neurons are rescaled to account for the missing (dropped) neurons.

The effect of this random dropout forces the network to learn a redundant representation of the data. Therefore, the network cannot rely on an activation of any set of hidden units since they may be turned off at any time during training and is forced to learn more general and robust patterns from the data.

This random dropout can effectively prevent overfitting. The following figure shows an example of applying dropout with probability Regularizing a neural network with dropout during the training phase, thereby half of the neurons become inactive randomly. However, during prediction, all neurons will contribute to computing the pre-activations of the next layer.

Regularizing a neural network with dropout

As shown here, one important point to remember is that units may drop randomly during training only, while for the evaluation phase, all the hidden units must be active (for instance, Regularizing a neural network with dropout or Regularizing a neural network with dropout). To ensure that the overall activations are on the same scale during training and prediction, the activations of the active neurons have to be scaled appropriately (for example, by halving the activation if the dropout probability was set to Regularizing a neural network with dropout).

However, since it is inconvenient to always scale activations when we make predictions in practice, TensorFlow and other tools scale the activations during training (for example, by doubling the activations if the dropout probability was set to Regularizing a neural network with dropout).

So, what is the relationship between dropout and ensemble learning? Since we drop different hidden neurons at each iteration, effectively we are training different models. When all these models are finally trained, we set the keep probability to 1 and use all the hidden units. This means we are taking the average activation from all the hidden units.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset