There's more...

Learning rate is one of the factors that decides the efficiency of the neural network. A high learning rate will diverge from the actual output, while a low learning rate will result in slow learning due to slow convergence. Neural network efficiency also depends on the weights that we assign to the neurons in every layer. Hence, a uniform distribution of weights during the early stages of training might help. 

The most commonly followed approach is to introduce dropouts to the layers. This forces the neural network to ignore some of the neurons during the training process. This will effectively prevent the neural network from memorizing the prediction process. How do we find out if a network has memorized the results? Well, we just need to expose the network to new data. If your accuracy metrics become worse after that, then you've got a case of overfitting.

Another possibility for increasing the efficiency of the neural network (and thus reducing overfitting) is to try for L1/L2 regularization in the network layers. When we add L1/L2 regularization to network layers, it will add an extra penalty term to the error function. L1 penalizes with the sum of the absolute value of the weights in the neurons, while L2 penalizes using the sum of squares of the weights. L2 regularization will give much better predictions when the output variable is a function of all input features. However, L1 regularization is preferred when the dataset has outliers and if not all the attributes are contributing to predicting the output variable. In most cases, the major reason for overfitting is the issue of memorization. Also, if we drop too many neurons, it will eventually underfit the data. This means we lose more useful data than we need to.

Note that the trade-off can vary depending on the different kinds of problems. Accuracy alone cannot ensure a good model performance every time. It is good to measure precision if we cannot afford the cost of a false positive prediction (such as in spam email detection). It is good to measure recall if we cannot afford the cost of a false negative prediction (such as in fraudulent transaction detection). The F1 score is optimal if there's an uneven distribution of the classes in the dataset. ROC curves are good to measure when there are approximately equal numbers of observations for each output class.

Once the evaluations are stable, we can check on the means to optimize the efficiency of the neural network. There are multiple methods to choose from. We can perform several training sessions to try to find out the optimal number of hidden layers, epochs, dropouts, and activation functions.

The following screenshot points to various hyper parameters that can influence neural network efficiency:

Note that dropOut(0.9) means we ignore 10% of neurons during training.

Other attributes/methods in the screenshot are the following:

  • weightInit() : This is to specify how the weights are assigned neurons at each layer.
  • updater(): This is to specify the gradient updater configuration. Adam is a gradient update algorithm.

In Chapter 12Benchmarking and Neural Network Optimization, we will walk through an example of hyperparameter optimization to automatically find the optimal parameters for you. It simply performs multiple training sessions on our behalf to find the optimal values by a single program execution. You may refer to Chapter 12Benchmarking and Neural Network Optimization, if you're interested in applying benchmarks to the application. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset