Preparing data for model building

In this chapter, we'll be using the Internet Movie Database (IMDb) movie reviews text data that's available in the Keras package. Note that there is no need to download this data from anywhere as it can be easily accessed from the Keras library using code that we will discuss soon. In addition, this dataset is preprocessed so that text data is converted into a sequence of integers. We cannot use text data directly for model building, and such preprocessing of text data into a sequence of integers is necessary before the data can be used as input for developing deep learning networks.

We will start by loading the imdb data using the dataset_imdb function, where we will also specify the number of most frequent words as 500 using num_words. Then, we'll split the imdb data into train and test datasets. Let's take a look at the following code to understand this data:

# IMDB data
imdb <- dataset_imdb(num_words = 500)
c(c(train_x, train_y), c(test_x, test_y)) %<-% imdb
length(train_x); length(test_x)
[1] 25000
[1] 25000

table(train_y)
train_y
    0     1 
12500 12500 

table(test_y)
test_y
    0     1 
12500 12500

Let's take a look at the preceding code:

train_x and test_x contain integers representing reviews in the train and test data, respectively.
Similarly, train_y and test_y contain 0 and 1 labels, representing negative and positive sentiments, respectively.
Using the length function, we can see that both train_x and test_x are based on 25,000 movie reviews each.
The tables for train_y and test_y show that there is an equal number of positive (12,500) and negative (12,500) reviews in the train and test data.

Having such a balanced dataset is useful in avoiding any bias due to class imbalance issues.

The words in the movie review are represented by unique integers and each integer that is assigned to a word is based on its overall frequency in the dataset. For example, integer 1 represents the most frequent word, while integer 2 represents the second most frequent word, and so on. In addition, integer 0 is not used for any specific word but it indicates an unknown word.

Let's take a look at the third and sixth sequences in the train_x data using the following code:

# Sequence of integers
train_x[[3]]
  [1]   1  14  47   8  30  31   7   4 249 108   7   4   2  54  61 369
 [17]  13  71 149  14  22 112   4   2 311  12  16   2  33  75  43   2
 [33] 296   4  86 320  35   2  19 263   2   2   4   2  33  89  78  12
 [49]  66  16   4 360   7   4  58 316 334  11   4   2  43   2   2   8
 [65] 257  85   2  42   2   2  83  68   2  15  36 165   2 278  36  69
 [81]   2   2   8 106  14   2   2  18   6  22  12 215  28   2  40   6
 [97]  87 326  23   2  21  23  22  12 272  40  57  31  11   4  22  47
[113]   6   2  51   9 170  23   2 116   2   2  13 191  79   2  89   2
[129]  14   9   8 106   2   2  35   2   6 227   7 129 113

train_x[[6]]
 [1]   1   2 128  74  12   2 163  15   4   2   2   2   2  32  85 156  45
[18]  40 148 139 121   2   2  10  10   2 173   4   2   2  16   2   8   4
[35] 226  65  12  43 127  24   2  10  10

for (i in 1:6) print(length(train_x[[i]]))

Output

[1] 218
[1] 189
[1] 141
[1] 550
[1] 147
[1] 43

From the preceding code and output, we can observe the following:

From the output of the third movie review-related sequence of integers, we can observe that the third review contains 141 integers between 1 (1st integer) and 369 (16th integer).
Since we restricted the use of the most frequent words to 500, for the third review, there is no integer larger than 500.
Similarly, from the output of the sixth review's related sequence of integers, we can observe that the sixth review contains 43 integers between 1 (1st integer) and 226 (35th integer).
Looking at the length of the first six sequences in the train_x data, we can observe that the length of the movie review varies between 43 (6th review in train data) and 550 (4th review in train data). Such variation in the length of the movie reviews is normal and is as expected.

Before we can develop a movie review sentiment classification model, we need to find a way to make the length of a sequence of integers the same for all the movie reviews. We can achieve this by padding sequences.

Table of Contents for Preparing data for model building

Create new playlist

Sign In

Sign Up

Table of Contents for
Preparing data for model building