Now that we're familiar with the data, let's look at it in more detail:
- Let's import the word index for the imdb data:
word_index = dataset_imdb_word_index()
We can look at the head of the word index using the following code:
head(word_index)
Here, we can see that there's is a list of key-value pairs, where the key is the word and the value is the integer that it's mapped to:
Let's also look at the number of unique words in our word index:
length((word_index))
Here, we can see that there are 88,584 unique words in the word index:
- Now, we create a reversed list of key-value pairs of the word index. We will use this list to decode the reviews in the IMDb dataset:
reverse_word_index <- names(word_index)
names(reverse_word_index) <- word_index
head(reverse_word_index)
Here, we can see that the reversed word index list is a list of key-value pairs, where the key is the integer index and the value is the associated word:
- Now, we decode the first review. Note that the word encodings are offset by three because 0,1,2 are reserved for padding, the start of the sequence, and out of vocabulary words, respectively:
decoded_review <- sapply(train_x[[1]], function(index) {
word <- if (index >= 3) reverse_word_index[[as.character(index -3)]]
if (!is.null(word)) word else "?"
}) cat(decoded_review)
The following screenshot shows the decoded version of the first review:
- Let's pad all the sequences to make them uniform in length:
train_x <- pad_sequences(train_x, maxlen = 80)
test_x <- pad_sequences(test_x, maxlen = 80)
cat('x_train shape:', dim(train_x), ' ')
cat('x_test shape:', dim(test_x), ' ')
All the sequences are padded to a length of 80:
Now, let's look at the first review after padding it:
train_x[1,]
Here, you can see that the review only has 80 indexes after padding:
- Now, we build the model for sentiment classification and view its summary:
model <- keras_model_sequential()
model %>%
layer_embedding(input_dim = 1000, output_dim = 128) %>%
layer_simple_rnn(units = 32) %>%
layer_dense(units = 1, activation = 'sigmoid') summary(model)
Here is the description of the model:
- Now, we compile the model and train it:
# compile model model %>% compile( loss = 'binary_crossentropy',
optimizer = 'adam',
metrics = c('accuracy') ) # train model model %>% fit(
train_x,train_y,
batch_size = 32,
epochs = 10,
validation_split = .2 )
- Finally, we evaluate the model's performance on the test data and print the metrics:
scores <- model %>% evaluate(
test_x, test_y,
batch_size = 32 ) cat('Test score:', scores[[1]],' ')
cat('Test accuracy', scores[[2]])
The following screenshot shows the performance metrics on the test data:
By doing this, we achieved an accuracy of around 71% on the test data.