Tuning and optimizing CNN hyperparameters

The following hyperparameters are very important and must be tuned to achieve optimized results.

  • Dropout: Used for random omission of feature detectors to prevent overfitting
  • Sparsity: Used to force activations of sparse/rare inputs
  • Adagrad: Used for feature-specific learning-rate optimization
  • Regularization: L1 and L2 regularization
  • Weight transforms: Useful for deep autoencoders
  • Probability distribution manipulation: Used for initial weight generation
  • Gradient normalization and clipping

Another important question is: when do you want to add a max pooling layer rather than a convolutional layer with the same stride? A max pooling layer has no parameters at all, whereas a convolutional layer has quite a few. Sometimes, adding a local response normalization layer that makes the neurons that most strongly activate inhibit neurons at the same location but in neighboring feature maps, encourages different feature maps to specialize and pushes them apart, forcing them to explore a wider range of features. It is typically used in the lower layers to have a larger pool of low-level features that the upper layers can build upon.

One of the main advantages observed during the training of large neural networks is overfitting, that is, generating very good approximations for the training data but emitting noise for the zones between single points. In the case of overfitting, the model is specifically adjusted to the training dataset, so it will not be used for generalization. Therefore, although it performs well on the training set, its performance on the test dataset and subsequent tests is poor because it lacks the generalization property:

Figure 14: Dropout versus without dropout

The main advantage of this method is that it avoids all neurons in a layer to synchronously optimize their weights. This adaptation made in random groups avoids all the neurons converging on the same goals, thus de-correlating the adapted weights. A second property discovered in the dropout application is that the activation of the hidden units becomes sparse, which is also a desirable characteristic.

Since in CNN, one of the objective functions is to minimize the evaluated cost, we must define an optimizer. The following optimizers are supported by DL4j:

  • SGD (learning rate only)
  • Nesterovs momentum
  • Adagrad
  • RMSProp
  • Adam
  • AdaDelta

In most of the cases, we can adopt the implemented RMSProp, which is an advanced form of gradient descent, if the performance is not satisfactory. RMSProp performs better because it divides the learning rate by an exponentially decaying average of squared gradients. The suggested setting value of the decay parameter is 0.9, while a good default value for the learning rate is 0.001.

More technically, by using the most common optimizer, such as Stochastic Gradient Descent (SGD), the learning rates must scale with 1/T to get convergence, where T is the number of iterations. RMSProp tries to overcome this limitation automatically by adjusting the step size so that the step is on the same scale as the gradients. So, if you're training a neural network, but computing the gradients is mandatory, using RMSProp would be the faster way of learning in a mini-batch setting. Researchers also recommend using a Momentum optimizer while training a deep CNN or DNN.

From the layering architecture's perspective, CNN is different compared to DNN; it has a different requirement as well as tuning criteria. Another problem with CNNs is that the convolutional layers require a huge amount of RAM, especially during training, because the reverse pass of backpropagation requires all the intermediate values computed during the forward pass. During inference (that is, when making a prediction for a new instance), the RAM occupied by one layer can be released as soon as the next layer has been computed, so you only need as much RAM as required by two consecutive layers.

However, during training, everything computed during the forward pass needs to be preserved for the reverse pass, so the amount of RAM needed is (at least) the total amount of RAM required by all layers. If your GPU runs out of memory while training a CNN, here are five things you could try to solve the problem (other than purchasing a GPU with more RAM):

  • Reduce the mini-batch size
  • Reduce dimensionality using a larger stride in one or more layers
  • Remove one or more layers
  • Use 16-bit floats instead of 32-bit floats
  • Distribute the CNN across multiple devices
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset