Padding and truncation of sequences

When developing the author classification model, the number of integers for each training and test text data need to be of equal length. We can achieve this by padding and truncating the sequence of integers, as follows:

# Padding and truncation
trainx <- pad_sequences(trainx, maxlen = 300)
testx <- pad_sequences(testx, maxlen = 300)
dim(trainx)
[1] 2500 300

Here, we are specifying the maximum length of all the sequences, that is, maxlen, to be 300. This will truncate any sequences that are longer than 300 integers in an article and add zeroes to sequences that are shorter than 300 integers in an article. Note that for padding and truncation, a default setting of "pre" has been used and is not specifically indicated in the code.

This means that for truncation and padding, the integers at the beginning of the sequence of integers are impacted. For padding and/or truncation toward the end of the sequence of integers, we can make use of padding = "post" and/or truncation = "post" within the code. We can also see that the dimensions of trainx show a 2,500 x 300 matrix.

Let's look at the output from text files 7 and 901 in the train data, as follows:

# Example of truncation
trainx[7,]
[1] 5 157 1 18 87 3 90 3 59 1 169 346 2 29 52 425
[17] 6 72 386 110 331 24 5 4 3 31 3 22 7 65 33 169
[33] 329 10 105 1 239 11 4 31 11 422 8 60 163 318 10 58
[49] 102 2 137 329 277 98 58 287 20 81 3 142 9 6 87 3
[65] 49 20 142 2 142 6 2 60 13 1 470 8 137 190 60 1
[81] 85 152 5 6 211 1 3 1 85 11 2 211 233 51 233 490
[97] 7 155 3 305 6 4 86 3 70 4 3 157 52 142 6 282
[113] 233 4 286 11 485 47 11 9 1 386 497 2 72 7 33 6
[129] 3 1 60 3 234 23 32 72 485 7 203 6 29 390 5 3
[145] 19 13 55 184 53 10 1 41 19 485 119 18 6 59 1 169
[161] 1 41 10 17 458 91 6 23 12 1 3 3 10 491 2 14
[177] 1 1 194 469 491 2 1 4 331 112 485 475 16 1 469 1
[193] 331 14 2 485 234 5 171 296 1 85 11 135 157 2 189 1
[209] 31 24 4 5 318 490 338 6 147 194 24 347 386 23 24 32
[225] 117 286 161 6 338 25 4 32 2 9 1 38 8 316 60 153
[241] 27 234 496 457 153 20 316 2 254 219 145 117 25 46 27 7
[257] 228 34 184 75 11 418 52 296 1 194 469 180 469 6 1 268
[273] 6 250 469 29 90 6 15 58 175 32 33 229 37 424 36 51
[289] 36 3 169 15 1 7 175 1 319 207 5 4

# Example of padding
trainx[901,]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[17] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[33] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[49] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[65] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[81] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[97] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[113] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[129] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[145] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[161] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[177] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[193] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[209] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[225] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[241] 0 0 0 0 0 0 0 0 0 0 0 0 74 356 7 9
[257] 199 12 11 61 145 31 22 399 79 145 1 133 3 1 28 203
[273] 29 1 319 3 18 101 470 31 29 2 20 5 33 369 116 134
[289] 7 2 25 17 303 2 5 222 100 28 6 5

From the preceding output, we can make the following observations:

  • Text file 7, which had 314 integers, has been reduced to 300 integers. Note that this step removed 14 integers at the beginning of the sequence.
  • Text file 901, which had 48 integers, now has 300 integers, which has been achieved by adding zeros at the beginning of the sequence to artificially make the total number of integers 300.

Next, we will partition the training data into train and validation data, which will be required for training and assessing the network at the time of fitting the model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset