Experimenting with a bidirectional LSTM layer

A bidirectional LSTM, as the name indicates, not only uses the sequence of integers provided as input but also makes use of its reverse order as additional input. There could be situations where this approach may help to achieve further model classification performance improvements by capturing useful patterns in the data that may not have been captured by the original LSTM network.

For this experiment, we will modify the LSTM layer in the first experiment, as shown in the following code:

# Model architecture
model <- keras_model_sequential() %>%
layer_embedding(input_dim = 500, output_dim = 32) %>%
bidirectional(layer_lstm(units = 32)) %>%
layer_dense(units = 1, activation = "sigmoid")
# Model summary
summary(model)
Model
__________________________________________________________
Layer (type) Output Shape Param #
==========================================================
embedding_8 (Embedding) (None, None, 32) 16000
__________________________________________________________
bidirectional_5 (Bidirect (None, 64) 16640
__________________________________________________________
dense_11 (Dense) (None, 1) 65
==========================================================
Total params: 32,705
Trainable params: 32,705
Non-trainable params: 0
__________________________________________________________

From the preceding code output, we can make the following observations:

  • We converted the LSTM layer into a bidirectional LSTM layer using the bidirectional () function.
  • This change doubles the number of parameters related to the LSTM layer to 16,640, as can be seen from the model summary.
  • The total number of parameters for this architecture now increases to 32,705. This increase in the number of parameters will further reduce the speed at which the network will be trained.

Here is a simple flow chart for the bidirectional LSTM network architecture:

The flow chart for the bidirectional LSTM network shows embedding, bidirectional, and dense layers. In the bidirectional LSTM layer, tanh is used as the activation function and the dense layer uses the sigmoid activation function. The code for compiling and training the model is as follows:

# Compiling model
model %>% compile(optimizer = "adam",
loss = "binary_crossentropy",
metrics = c("acc"))

# Fitting model
model_four <- model %>% fit(train_x, train_y,
epochs = 10,
batch_size = 128,
validation_split = 0.2)

# Loss and accuracy plot
plot(model_four)

As seen from the preceding code, we will continue to use the adam optimizer and keep the other settings the same as earlier for compiling and then fitting the model.

After we train the model, the accuracy and loss values for each epoch are stored in model_four. We use the loss and accuracy values in model_four to develop the following plot:

From the preceding plot, we can make the following observations:

  • The loss and accuracy plot doesn't show any cause for concern regarding over-fitting as the lines for training and validation are reasonably close to each other.
  • The plot also shows that we do not need more than ten epochs to train this model.

We will obtain the loss, accuracy, and confusion matrix for the training data using the following code:

# Loss and accuracy
model %>% evaluate(train_x, train_y)
$loss
[1] 0.3410529
$acc
[1] 0.85232

pred <- model %>% predict_classes(train_x)

# Confusion Matrix
table(Predicted=pred, Actual=imdb$train$y)
Actual
Predicted 0 1
0 10597 1789
1 1903 10711

From the preceding code output, we can make the following observations:

  • For the training data, we obtain loss and accuracy values of 0.341 and 0.852 respectively. These results are only marginally inferior to the previous results and are not significantly different.
  • The confusion matrix this time shows a more even performance for correctly classifying positive and negative movie reviews.
  • For negative movie reviews, the correct classification rate is about 84.8% and for positive reviews, it is about 85.7%.
  • This difference of about 1% is much smaller than what we observed for the earlier models.

We will now repeat the preceding process with the test data. Following is the code for obtaining the loss, accuracy, and confusion matrix:

# Loss and accuracy
model %>% evaluate(test_x, test_y)
$loss
[1] 0.3737377
$acc
[1] 0.83448

pred1 <- model %>% predict_classes(test_x)

#Confusion Matrix
table(Predicted=pred1, Actual=imdb$test$y)
Actual
Predicted 0 1
0 10344 1982
1 2156 10518

From the preceding code output, we can make the following observations:

  • For the test data, the loss and accuracy values are 0.374 and 0.834 respectively.
  • The confusion matrix shows that the negative reviews are correctly classified by the model at a rate of about 82.8%.
  • This model correctly classifies positive movie reviews at a rate of about 84.1%.
  • These results are consistent with those obtained for the training data.

The experiment with bidirectional LSTM helped to obtain somewhat comparable performance in terms of loss and accuracy than that were obtained with two LSTM layers in the previous experiment. However, the main gain that is observed is in achieving results where we can correctly classify a negative or positive movie review with much better consistency.

In this chapter, we used the LSTM network to develop a movie review sentiment classification model. When data involves sequences, LSTM networks help to capture long term dependencies in the sequence of words or integers. We experimented with four different LSTM models by making some changes to the model and the results for the same are summarized in the following table.

This table summarizes the performance of the four LSTM models:

Model LSTM Layers Optimizer Data Loss Accuracy

Accuracy for Negative Reviews or Specificity

Accuracy for Positive Reviews or Sensitivity

One 1 rmsprop Train 0.375 82.8% 74.1% 91.4%
      Test 0.399 81.9% 73.3% 90.7%
Two 1 adam Train 0.360 84.3% 88.9% 79.7%
      Test 0.385 82.9% 86.9% 78.8%
Three 2 adam Train 0.339 85.5% 90.0% 81.0%
      Test 0.376 83.7% 87.3% 80.0%
Four bidirectional adam Train 0.341 85.2% 84.8% 85.7%
      Test 0.374 83.4% 82.8% 84.1%

We can make the following observations from the preceding table:

  • Out of the four models that were tried, the bidirectional LSTM model provided better performance compared to the other three models. It has the lowest loss value based on test data.
  • Although overall accuracy is slightly lower for the fourth model compared to the third model, accuracy for correctly classifying negative and positive reviews is much more consistent, varying from 82.8% to 84.1%, or a spread of only about 1.3%.
  • The third model seems biased toward negative reviews that correctly classifies such reviews at a rate of 87.3% for the test data. For the third model, the correct classification of positive reviews in the test data is only at 80%. Hence, the spread between the correct classification of negative and positive reviews for the third model is more than 7%.
  • The spread between sensitivity and specificity is even higher for the first two models. 

Although the fourth model provides good results, additional improvements can certainly be explored by experimenting further with other variables. Variables that can be used for further experiments may include the number of most frequent words, use of pre versus post for padding and/or truncation, the maximum length used for padding, the number of units in the LSTM layer, and the choice of another optimizer at the time of compiling the model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset