Preparing text data for model building

We will continue to use IMDB movie review data that we used in the previous chapter on recurrent neural networks. This data is already available in a format where we can use it for developing deep network models with minimum need for data processing.

Let's take a look at the following code:

# IMDB data
library(keras)
imdb <- dataset_imdb(num_words = 500) 
c(c(train_x, train_y), c(test_x, test_y)) %<-% imdb
train_x <- pad_sequences(train_x, maxlen = 200) 
test_x <- pad_sequences(test_x, maxlen = 200)

The sequence of integers capturing train and test data is stored in train_x and test_x respectively. Similarly, train_y and test_y store labels capturing information about whether movie reviews are positive or negative. We have specified the number of most frequent words to be 500. For padding, we are using 200 as the maximum length of a sequence of integers for both train and test data.

When the actual length of integers is less than 200, then zeros get added at the beginning of the sequence to artificially increase the length of integers to 200. However, when the length of integers is more than 200, integers at the beginning are removed so that the total length of integers is maintained at 200.

As mentioned earlier, both train and test datasets are balanced and contain data involving 25,000 movie reviews each. For each movie review, positive or negative labels are also available.

Note that the choice of value for maxlen can impact model performance. If the value chosen is too small, more words or integers in a sequence will get truncated. On the other hand, if the value chosen is too large, then more words or integers in a sequence will need padding, with zeroes getting added. One way to avoid too much padding or too much truncation is to choose a value closer to the median.

Table of Contents for Preparing text data for model building

Create new playlist

Sign In

Sign Up

Table of Contents for
Preparing text data for model building