Padding sequences

Padding the text sequences is carried out to ensure that all the sequences have the same length. Let's take a look at the following code:

# Padding and truncation
train_x <- pad_sequences(train_x, maxlen = 100)
test_x <- pad_sequences(test_x, maxlen = 100)

From the preceding code, we can observe the following:

  • We can achieve equal length for all the sequences of integers with the help of the pad_sequences function and by specifying a value for maxlen.
  • In this example, we have restricted the length of each movie review sequence in the train and test data to 100. Note that before padding of sequences, the structure of train_x and test_x is a list of 25,000 reviews.
  • However, after padding the sequences, the structure for both changes to a matrix that's 25,000 x 100. This can be easily verified by running str(train_x) before and after padding.

To observe the impact of padding on a sequence of integers, let's take a look at the following code, along with its output:

# Sequence of integers
train_x[3,]
[1] 2 4 2 33 89 78 12 66 16 4 360 7 4 58 316 334
[17] 11 4 2 43 2 2 8 257 85 2 42 2 2 83 68 2
[33] 15 36 165 2 278 36 69 2 2 8 106 14 2 2 18 6
[49] 22 12 215 28 2 40 6 87 326 23 2 21 23 22 12 272
[65] 40 57 31 11 4 22 47 6 2 51 9 170 23 2 116 2
[81] 2 13 191 79 2 89 2 14 9 8 106 2 2 35 2 6
[97] 227 7 129 113

train_x[6,]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[17] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[33] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[49] 0 0 0 0 0 0 0 0 0 1 2 128 74 12 2 163
[65] 15 4 2 2 2 2 32 85 156 45 40 148 139 121 2 2
[81] 10 10 2 173 4 2 2 16 2 8 4 226 65 12 43 127
[97] 24 2 10 10

The output of the third sequence of integers after padding of the train_x can be seen in the preceding code. Here, we can observe the following:

  • The third sequence now has a length of 100. The third sequence originally had 141 integers and we can observe that 41 integers that were located at the beginning of the sequence have been truncated.
  • On the other hand, the output of the sixth sequence shows a different pattern.
  • The sixth sequence originally had a length of 43, but now 57 zeros have been added to the beginning of the sequence to artificially extended the length to 100.
  • All 25,000 sequences of integers related to movie reviews in each of the train and test data are impacted in a similar way.

In the next section, we will develop an architecture for a recurrent neural network that will be used for developing a movie review sentiment classification model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset