The maximum length for padding sequences

So far, we have used a maximum length of 100 for padding sequences of movie reviews in the train and test data. Let's look at the summary of the length of movie reviews in the train and test data using the following code:

# Summary of padding sequences
z <- NULL
for (i in 1:25000) {z[i] <- print(length(train_x[[i]]))}
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   11.0   130.0   178.0   238.7   291.0  2494.0 

z <- NULL
for (i in 1:25000) {z[i] <- print(length(test_x[[i]]))}
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    7.0   128.0   174.0   230.8   280.0  2315.0

From the preceding code, we can make the following observations:

From the summary of the length of movie reviews in the train data, we can see that the minimum length is 11, the maximum length is 2,494, and that the median length is 178.
Similarly, the test data has a minimum review length of 7, a maximum length of 2,315, and a median length of 174.

Note that when the maximum padding length is below the median (which is the case with a maximum length of 100), we tend to truncate more movie reviews by removing words beyond 100. At the same time, when we choose a maximum length for padding to be significantly above the median, we will have a situation where a higher number of movie reviews will need to contain zeros and fewer number of reviews will be truncated.

In this section, we are going to explore the impact of keeping the maximum length of the sequence of words in the movie reviews near the median value. The code for incorporating this change is as follows:

# IMDB data
c(c(train_x, train_y), c(test_x, test_y)) %<-% imdb
train_x <- pad_sequences(train_x, maxlen = 200)  
test_x <- pad_sequences(test_x, maxlen = 200)

# Model architecture
model <- keras_model_sequential() %>% 
         layer_embedding(input_dim = 500, output_dim = 32) %>% 
         layer_simple_rnn(units = 32, 
                          return_sequences = TRUE, 
                          activation = 'relu') %>% 
         layer_simple_rnn(units = 32, 
                          return_sequences = TRUE, 
                          activation = 'relu') %>% 
         layer_simple_rnn(units = 32, 
                          return_sequences = TRUE, 
                          activation = 'relu') %>% 
         layer_simple_rnn(units = 32, 
                          activation = 'relu') %>%
         layer_dense(units = 1, activation = "sigmoid")

# Compile model
model %>% compile(optimizer = "rmsprop",
         loss = "binary_crossentropy",
         metrics = c("acc"))

# Fit model
model_five <- model %>% fit(train_x, train_y,
         epochs = 10,
         batch_size = 128,
         validation_split = 0.2)

From the preceding code, we can see that we run the model after specifying maxlen as 200. We keep everything else the same as what we had for model_four.

The plot for the loss and accuracy for the training and validation data is as follows:

From the preceding plot, we can make the following observations:

There's the absence of an overfitting issue since the training and validation data points are very close to each other.
The loss and accuracy based on the test data were calculated as 0.383 and 0.830, respectively.
The loss and accuracy values are at their best level at this stage.

The confusion matrix based on the test data is as follows:

# Prediction and confusion matrix
pred1 <- model %>%   predict_classes(test_x)
table(Predicted=pred1, Actual=imdb$test$y)
         Actual
Predicted     0     1
        0 10066  1819
        1  2434 10681

From the confusion matrix, we can make the following observations:

This classification model seems to performs slightly better when correctly classifying the movie review as positive (10,681) compared to when classifying a negative (10,066) review correctly.
As far as reviews that are classified incorrectly are concerned, the trend that we had observed earlier, where negative movie reviews were mistakenly classified by the model as positive being on the higher side, exists in this case too.

In this section, we experimented with a number of units, activation functions, the number of recurrent layers in the network, and the amount of padding in order to improve the movie review sentiment classification model. Some other factors that you could explore further include the number of most frequent words to include and changing the maximum length at the time of padding sequences.

Table of Contents for The maximum length for padding sequences

Create new playlist

Sign In

Sign Up

Table of Contents for
The maximum length for padding sequences