Tuning and optimizing CNN hyperparameters

The following hyperparameters are very important and must be tuned to achieve optimized results.

Dropout: Used for random omission of feature detectors to prevent overfitting
Sparsity: Used to force activations of sparse/rare inputs
Adagrad: Used for feature-specific learning-rate optimization
Regularization: L1 and L2 regularization
Weight transforms: Useful for deep autoencoders
Probability distribution manipulation: Used for initial weight generation
Gradient normalization and clipping

Another important question is: when do you want to add a max pooling layer rather than a convolutional layer with the same stride? A max pooling layer has no parameters at all, whereas a convolutional layer has quite a few. Sometimes, adding a local response normalization layer that makes the neurons that most strongly activate inhibit neurons at the same location but in neighboring feature maps, encourages different feature maps to specialize and pushes them apart, forcing them to explore a wider range of features. It is typically used in the lower layers to have a larger pool of low-level features that the upper layers can build upon.

One of the main advantages observed during the training of large neural networks is overfitting, that is, generating very good approximations for the training data but emitting noise for the zones between single points. In the case of overfitting, the model is specifically adjusted to the training dataset, so it will not be used for generalization. Therefore, although it performs well on the training set, its performance on the test dataset and subsequent tests is poor because it lacks the generalization property:

Figure 14: Dropout versus without dropout

The main advantage of this method is that it avoids all neurons in a layer to synchronously optimize their weights. This adaptation made in random groups avoids all the neurons converging on the same goals, thus de-correlating the adapted weights. A second property discovered in the dropout application is that the activation of the hidden units becomes sparse, which is also a desirable characteristic.

Since in CNN, one of the objective functions is to minimize the evaluated cost, we must define an optimizer. The following optimizers are supported by DL4j:

SGD (learning rate only)
Nesterovs momentum
Adagrad
RMSProp
Adam
AdaDelta

In most of the cases, we can adopt the implemented RMSProp, which is an advanced form of gradient descent, if the performance is not satisfactory. RMSProp performs better because it divides the learning rate by an exponentially decaying average of squared gradients. The suggested setting value of the decay parameter is 0.9, while a good default value for the learning rate is 0.001.

More technically, by using the most common optimizer, such as Stochastic Gradient Descent (SGD), the learning rates must scale with 1/T to get convergence, where T is the number of iterations. RMSProp tries to overcome this limitation automatically by adjusting the step size so that the step is on the same scale as the gradients. So, if you're training a neural network, but computing the gradients is mandatory, using RMSProp would be the faster way of learning in a mini-batch setting. Researchers also recommend using a Momentum optimizer while training a deep CNN or DNN.

From the layering architecture's perspective, CNN is different compared to DNN; it has a different requirement as well as tuning criteria. Another problem with CNNs is that the convolutional layers require a huge amount of RAM, especially during training, because the reverse pass of backpropagation requires all the intermediate values computed during the forward pass. During inference (that is, when making a prediction for a new instance), the RAM occupied by one layer can be released as soon as the next layer has been computed, so you only need as much RAM as required by two consecutive layers.

However, during training, everything computed during the forward pass needs to be preserved for the reverse pass, so the amount of RAM needed is (at least) the total amount of RAM required by all layers. If your GPU runs out of memory while training a CNN, here are five things you could try to solve the problem (other than purchasing a GPU with more RAM):

Reduce the mini-batch size
Reduce dimensionality using a larger stride in one or more layers
Remove one or more layers
Use 16-bit floats instead of 32-bit floats
Distribute the CNN across multiple devices

Table of Contents for Tuning and optimizing CNN hyperparameters

Create new playlist

Sign In

Sign Up

Table of Contents for
Tuning and optimizing CNN hyperparameters