Preparing data for model building

In this chapter, we'll be using the Internet Movie Database (IMDb) movie reviews text data that's available in the Keras package. Note that there is no need to download this data from anywhere as it can be easily accessed from the Keras library using code that we will discuss soon. In addition, this dataset is preprocessed so that text data is converted into a sequence of integers. We cannot use text data directly for model building, and such preprocessing of text data into a sequence of integers is necessary before the data can be used as input for developing deep learning networks.

We will start by loading the imdb data using the dataset_imdb function, where we will also specify the number of most frequent words as 500 using num_words. Then, we'll split the imdb data into train and test datasets. Let's take a look at the following code to understand this data:

# IMDB data
imdb <- dataset_imdb(num_words = 500)
c(c(train_x, train_y), c(test_x, test_y)) %<-% imdb
length(train_x); length(test_x)
[1] 25000
[1] 25000

table(train_y)
train_y
0 1
12500 12500

table(test_y)
test_y
0 1
12500 12500

Let's take a look at the preceding code:

  • train_x and test_x contain integers representing reviews in the train and test data, respectively.
  • Similarly, train_y and test_y contain 0 and 1 labels, representing negative and positive sentiments, respectively.
  • Using the length function, we can see that both train_x and test_x are based on 25,000 movie reviews each.
  • The tables for train_y and test_y show that there is an equal number of positive (12,500) and negative (12,500) reviews in the train and test data.

Having such a balanced dataset is useful in avoiding any bias due to class imbalance issues.

The words in the movie review are represented by unique integers and each integer that is assigned to a word is based on its overall frequency in the dataset. For example, integer 1 represents the most frequent word, while integer 2 represents the second most frequent word, and so on. In addition, integer 0 is not used for any specific word but it indicates an unknown word.

Let's take a look at the third and sixth sequences in the train_x data using the following code:

# Sequence of integers
train_x[[3]]
[1] 1 14 47 8 30 31 7 4 249 108 7 4 2 54 61 369
[17] 13 71 149 14 22 112 4 2 311 12 16 2 33 75 43 2
[33] 296 4 86 320 35 2 19 263 2 2 4 2 33 89 78 12
[49] 66 16 4 360 7 4 58 316 334 11 4 2 43 2 2 8
[65] 257 85 2 42 2 2 83 68 2 15 36 165 2 278 36 69
[81] 2 2 8 106 14 2 2 18 6 22 12 215 28 2 40 6
[97] 87 326 23 2 21 23 22 12 272 40 57 31 11 4 22 47
[113] 6 2 51 9 170 23 2 116 2 2 13 191 79 2 89 2
[129] 14 9 8 106 2 2 35 2 6 227 7 129 113

train_x[[6]]
[1] 1 2 128 74 12 2 163 15 4 2 2 2 2 32 85 156 45
[18] 40 148 139 121 2 2 10 10 2 173 4 2 2 16 2 8 4
[35] 226 65 12 43 127 24 2 10 10

for (i in 1:6) print(length(train_x[[i]]))

Output

[1] 218
[1] 189
[1] 141
[1] 550
[1] 147
[1] 43

From the preceding code and output, we can observe the following:

  • From the output of the third movie review-related sequence of integers, we can observe that the third review contains 141 integers between 1 (1st integer) and 369 (16th integer).
  • Since we restricted the use of the most frequent words to 500, for the third review, there is no integer larger than 500.
  • Similarly, from the output of the sixth review's related sequence of integers, we can observe that the sixth review contains 43 integers between 1 (1st integer) and 226 (35th integer).
  • Looking at the length of the first six sequences in the train_x data, we can observe that the length of the movie review varies between 43 (6th review in train data) and 550 (4th review in train data). Such variation in the length of the movie reviews is normal and is as expected. 

Before we can develop a movie review sentiment classification model, we need to find a way to make the length of a sequence of integers the same for all the movie reviews. We can achieve this by padding sequences.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset