Tuning LSTM hyperparameters and GRU

Nevertheless, I still believe it is possible to attain about 100% accuracy with more LSTM layers. The following are the hyperparameters that I would still try to tune to see the accuracy:

// Hyper parameters for the LSTM training
val learningRate = 0.001f
val trainingIters = trainingDataCount * 1000 // Loop 1000 times on the dataset
val batchSize = 1500 // I would set it 5000 and see the performance
val displayIter = 15000 // To show test set accuracy during training
val numLstmLayer = 3 // 5, 7, 9 etc.

There are many other variants of the LSTM cell. One particularly popular variant is the Gated Recurrent Unit (GRU) cell, which is a slightly dramatic variation on the LSTM. It also merges the cell state and hidden state and makes some other changes. The resulting model is simpler than standard LSTM models and has been growing increasingly popular. This cell was proposed by Kyunghyun Cho et al. in a 2014 paper that also introduced the encoder-decoder network we mentioned earlier.

For this type of LSTM, interested readers should refer to the following publications:
  • Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, K. Cho et al. (2014).
  • A 2015 paper by Klaus Greff et al., LSTM: A Search Space Odyssey, seems to show that all LSTM variants perform roughly the same.

Technically, a GRU cell is a simplified version of an LSTM cell, where both the state vectors are merged into a single vector called h(t). A single gate controller controls both the forget gate and the input gate. If the gate controller outputs a 1, the input gate is open and the forget gate is closed:

Figure 18: Internal structure of a GRU cell

On the other hand, if it outputs a 0, the opposite happens. Whenever a memory must be stored, the location where it will be stored is erased first, which is actually a frequent variant to the LSTM cell in and of itself. The second simplification is that since the full state vector is output at every time step, there is no output gate. However, there is a new gate controller introduced that controls which part of the previous state will be shown to the main layer. The following equations are used to do the GRU computations of a cell's long-term state, its short-term state, and its output at each time step for a single instance:

The LSTM and GRU cells are one of the main reasons for the success of RNNs in recent years, in particular for applications in NLP.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset