Understanding the CBOW model

Let's say we have a neural network with an input layer, a hidden layer, and an output layer. The goal of the network is to predict a word given its surrounding words. The word that we are trying to predict is called the target word and the words surrounding the target word are called the context words.

How many context words do we use to predict the target word? We use a window of size to choose the context word. If the window size is 2, then we use two words before and two words after the target word as the context words.

Let's consider the sentence, The Sun rises in the east with the word rises as the target word. If we set the window size as 2, then we take the words the and sun, which are the two words before, and in and the which are the two words after the target word rises as context words, as shown in the following figure:

So the input to the network is context words and the output is a target word. How do we feed these inputs to the network? The neural network accepts only numeric input so we cannot feed the raw context words directly as an input to the network. Hence, we convert all the words in the given sentence into a numeric form using the one-hot encoding technique, as shown in the following figure:

The architecture of the CBOW model is shown in the following figure. As you can see, we feed the context words, the, sun, in, and the, as inputs to the network and it predicts the target word rises as an output:

In the initial iteration, the network cannot predict the target word correctly. But over a series of iterations, it learns to predict the correct target word using gradient descent. With gradient descent, we update the weights of the network and find the optimal weights with which we can predict the correct target word.

As we have one input, one hidden, and one output layer, as shown in the preceding figure, we will have two weights:

  • Input layer to hidden layer weight,
  • Hidden layer to output layer weight,

During the training process, the network will try to find the optimal values for these two sets of weights so that it can predict the correct target word.

It turns out that the optimal weights between the input to a hidden layer forms the vector representation of words. They basically constitute the semantic meaning of the words. So, after training, we simply remove the output layer and take the weights between the input and hidden layers and assign them to the corresponding words.

After training, if we look at the matrix, it represents the embeddings for each of the words. So, the embedding for the word sun is [0.0, 0.3,0.3,0.6,0.1 ]:

Thus, the CBOW model learns to predict the target word with the given context words. It learns to predict the correct target word using gradient descent. During training, it updates the weights of the network through gradient descent and finds the optimal weights with which we can predict the correct target word. The optimal weights between the input and hidden layers form the vector representations of a word. So, after training, we simply take the weights between the input and hidden layers and assign them as a vector to the corresponding words.

Now that we have an intuitive understanding of the CBOW model, we will go into detail and learn mathematically how exactly the word embeddings are computed.

We learned that weights between the input and the hidden layers basically form the vector representation of the words. But how exactly does the CBOW model predicts the target word? How does it learn the optimal weights using backpropagation? Let's look at this in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset