Padding and truncation of sequences

When developing the author classification model, the number of integers for each training and test text data need to be of equal length. We can achieve this by padding and truncating the sequence of integers, as follows:

# Padding and truncation
trainx <- pad_sequences(trainx, maxlen = 300) 
testx <- pad_sequences(testx, maxlen = 300)
dim(trainx) 
[1] 2500  300

Here, we are specifying the maximum length of all the sequences, that is, maxlen, to be 300. This will truncate any sequences that are longer than 300 integers in an article and add zeroes to sequences that are shorter than 300 integers in an article. Note that for padding and truncation, a default setting of "pre" has been used and is not specifically indicated in the code.

This means that for truncation and padding, the integers at the beginning of the sequence of integers are impacted. For padding and/or truncation toward the end of the sequence of integers, we can make use of padding = "post" and/or truncation = "post" within the code. We can also see that the dimensions of trainx show a 2,500 x 300 matrix.

Let's look at the output from text files 7 and 901 in the train data, as follows:

# Example of truncation
trainx[7,]
  [1]   5 157   1  18  87   3  90   3  59   1 169 346   2  29  52 425
 [17]   6  72 386 110 331  24   5   4   3  31   3  22   7  65  33 169
 [33] 329  10 105   1 239  11   4  31  11 422   8  60 163 318  10  58
 [49] 102   2 137 329 277  98  58 287  20  81   3 142   9   6  87   3
 [65]  49  20 142   2 142   6   2  60  13   1 470   8 137 190  60   1
 [81]  85 152   5   6 211   1   3   1  85  11   2 211 233  51 233 490
 [97]   7 155   3 305   6   4  86   3  70   4   3 157  52 142   6 282
[113] 233   4 286  11 485  47  11   9   1 386 497   2  72   7  33   6
[129]   3   1  60   3 234  23  32  72 485   7 203   6  29 390   5   3
[145]  19  13  55 184  53  10   1  41  19 485 119  18   6  59   1 169
[161]   1  41  10  17 458  91   6  23  12   1   3   3  10 491   2  14
[177]   1   1 194 469 491   2   1   4 331 112 485 475  16   1 469   1
[193] 331  14   2 485 234   5 171 296   1  85  11 135 157   2 189   1
[209]  31  24   4   5 318 490 338   6 147 194  24 347 386  23  24  32
[225] 117 286 161   6 338  25   4  32   2   9   1  38   8 316  60 153
[241]  27 234 496 457 153  20 316   2 254 219 145 117  25  46  27   7
[257] 228  34 184  75  11 418  52 296   1 194 469 180 469   6   1 268
[273]   6 250 469  29  90   6  15  58 175  32  33 229  37 424  36  51
[289]  36   3 169  15   1   7 175   1 319 207   5   4

# Example of padding
trainx[901,]
  [1]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 [17]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 [33]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 [49]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 [65]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 [81]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 [97]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[113]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[129]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[145]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[161]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[177]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[193]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[209]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[225]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
[241]   0   0   0   0   0   0   0   0   0   0   0   0  74 356   7   9
[257] 199  12  11  61 145  31  22 399  79 145   1 133   3   1  28 203
[273]  29   1 319   3  18 101 470  31  29   2  20   5  33 369 116 134
[289]   7   2  25  17 303   2   5 222 100  28   6   5

From the preceding output, we can make the following observations:

Text file 7, which had 314 integers, has been reduced to 300 integers. Note that this step removed 14 integers at the beginning of the sequence.
Text file 901, which had 48 integers, now has 300 integers, which has been achieved by adding zeros at the beginning of the sequence to artificially make the total number of integers 300.

Next, we will partition the training data into train and validation data, which will be required for training and assessing the network at the time of fitting the model.

Table of Contents for Padding and truncation of sequences

Create new playlist

Sign In

Sign Up

Table of Contents for
Padding and truncation of sequences