How to do it...

Now that we're familiar with the data, let's look at it in more detail:

Let's import the word index for the imdb data:

word_index = dataset_imdb_word_index()

We can look at the head of the word index using the following code:

head(word_index)

Here, we can see that there's is a list of key-value pairs, where the key is the word and the value is the integer that it's mapped to:

Let's also look at the number of unique words in our word index:

length((word_index))

Here, we can see that there are 88,584 unique words in the word index:

Now, we create a reversed list of key-value pairs of the word index. We will use this list to decode the reviews in the IMDb dataset:

reverse_word_index <- names(word_index)
names(reverse_word_index) <- word_index
head(reverse_word_index)

Here, we can see that the reversed word index list is a list of key-value pairs, where the key is the integer index and the value is the associated word:

Now, we decode the first review. Note that the word encodings are offset by three because 0,1,2 are reserved for padding, the start of the sequence, and out of vocabulary words, respectively:

decoded_review <- sapply(train_x[[1]], function(index) {

  word <- if (index >= 3) reverse_word_index[[as.character(index -3)]]
  if (!is.null(word)) word else "?"

})

cat(decoded_review)

The following screenshot shows the decoded version of the first review:

Let's pad all the sequences to make them uniform in length:

train_x <- pad_sequences(train_x, maxlen = 80)
test_x <- pad_sequences(test_x, maxlen = 80)
cat('x_train shape:', dim(train_x), '
')
cat('x_test shape:', dim(test_x), '
')

All the sequences are padded to a length of 80:

Now, let's look at the first review after padding it:

train_x[1,]

Here, you can see that the review only has 80 indexes after padding:

Now, we build the model for sentiment classification and view its summary:

model <- keras_model_sequential()
model %>%
  layer_embedding(input_dim = 1000, output_dim = 128) %>%
  layer_simple_rnn(units = 32) %>%
  layer_dense(units = 1, activation = 'sigmoid')

summary(model)

Here is the description of the model:

Now, we compile the model and train it:

# compile model
model %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = c('accuracy')
)

# train model
model %>% fit(
  train_x,train_y,
  batch_size = 32,
  epochs = 10,
  validation_split = .2
)

Finally, we evaluate the model's performance on the test data and print the metrics:

scores <- model %>% evaluate(
  test_x, test_y,
  batch_size = 32
)

cat('Test score:', scores[[1]],'
')

cat('Test accuracy', scores[[2]])

The following screenshot shows the performance metrics on the test data:

By doing this, we achieved an accuracy of around 71% on the test data.

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...