How it works...

In this example, we used the built-in dataset of IMDb reviews from the keras library. We loaded the training and testing partitions of the data and had a look at the structure of these data partitions. We saw that the data had been mapped to a specific sequence of integer values, each integer representing a particular word in a dictionary. This dictionary has a rich collection of words arranged based on the frequency of each word getting used in the corpus. From this, we could see that the dictionary is a list of key-value pairs, with the keys representing the words and the values representing the index of the word in the dictionary. To discard the words that are not frequently used, we provided a threshold of 1,000; that is, we kept only the top 1,000 most frequent words in our training dataset and ignored the rest. Then, we moved to the data processing part.

In step 1, we imported the word index for the IMDb dataset. In this word index, the words in the data were encoded and indexed by the overall frequency in the dataset. In step 2, we created a reversed list of key-value pairs of the word index that was used to decode the sentences back into their original version from a series of encoded integers. In step 3, we showcased how to regenerate a sample review.
In step 4, we prepared the data so that it could be fed into the model. Since we cannot directly pass a list of integers into the model, we converted them into uniformly shaped tensors. To make the length of all the reviews uniform, we can follow either of these two approaches:

  • One-hot encoding: This will convert the sequences into tensors of the same length. The size of the matrix will be number of words * number of reviews. This approach is computationally heavy.
  • Pad the reviews: Alternatively, we can pad all the sequences so that they all have the same length. This will create an integer tensor of the shape num_examples * max_length. The max_length argument is used to cap the maximum number of words that we want to keep in all the reviews.

Since the second approach is less memory- and computationally-intensive, we went for the second approach; that is, we padded the sequences to a maximum length of 80.

In step 5, we defined a sequential Keras model and configured its layers. The first layer is the embedding layer and is used to generate the context of the word sequences from our data and provide information about the relevant features. In an embedding, the words are represented by dense vector representations. Each vector represents the projection of the word into a continuous vector space, which is then learned from the text and is based on the words that surround a particular word. The position of the word in the vector space is referred to as its embedding. When we do embedding, we represent each review in terms of some latent factors. For example, the word brilliant can be represented by a vector; let's say, [.32, .02, .48, .21, .56, .15]. This is computationally efficient when we're using massive datasets since it reduces dimensionality. The embedded vectors also get updated during the training process of the deep neural network, which helps in identifying similar words in a multi-dimensional space. Word embeddings also reflect how words are related to each other semantically. For example, words such as talking and talked can be thought of as related in the same way as swimming is related to swam.

The following diagram shows a pictorial representation of word embedding:

The embedding layer is defined by specifying three arguments:

  • input_dim: This is the size of the vocabulary in the text data. In our example, the text data is an integer that's been encoded to values between 0-999. Due to this, the size of the vocabulary is 1,000 words.
  • output_dim: This is the size of the vector space in which words will be embedded. We specified it as 128.
  • input_length: This is the length of input sequences, as we define it for any input layer of a Keras model. 
    In the next layer, we defined a simple RNN model with 32 hidden units. If n is the number of input dimensions and d is the number of hidden units in the RNN layer, then the number of trainable parameters can be given by the following equation:

The last layer is densely connected with a single output node. Here, we used the sigmoid activation function since this is a binary classification task. In step 6, we compiled the model. We specified binary_crossentropy as the loss function since we were dealing with binary classification and adam as the optimizer. Then, we trained our model with a validation split of 20%. Finally, in the last step, we evaluated the test accuracy of our model to see how our model performed on the test data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset