CBOW with multiple context words

Now that we understood how the CBOW model works with a single word as a context, we will see how it will work when you have multiple words as context words. The architecture of CBOW with multiple input words as a context is shown in the following figure:

There is not much difference between the multiple words as a context and a single word as a context. The difference is that, with multiple contexts words as inputs, we take the average of all the input context words. That is, as a first step, we forward propagate the network and compute the value of by multiplying input and weights , as we saw in the CBOW with a single context word section:

But, here, since we have multiple context words, we will have multiple inputs (that is ), where is the number of context words, and we simply take the average of them and multiply with the weight matrix, shown as follows:

Similar to what we learned in the CBOW with single context word section, represents the vector representation of the input context word . represents the vector representation of the input word , and so on.

We denote the representation of the input context word by , the representation of the input context word by , and so on. So, we can rewrite the preceding equation as:

Here, represents the number of context words.

Computing the value of is the same as we saw in the previous section:

Here, denotes the vector representation of the word in the vocabulary.

Substituting equation (6) in equation (7), we write the following:

The preceding equation gives us the similarity between the word in the vocabulary and the average representations of given input context words.

The loss function is the same as we saw in the single word context and it is given as:

Now, there is a small difference in backpropagation. We know that in backpropagation we compute gradients and update our weights according to the weight update rule. Recall that, in the previous section, this is how we update the weights:

Since, here, we have multiple context words as an input, we take an average of context words while computing :

Computing is the same as we saw in the previous section:

So, in a nutshell, in the multi-word context, we just take the average of multiple context input words and build the model as we did in the single word context of CBOW.

Table of Contents for CBOW with multiple context words

Create new playlist

Sign In

Sign Up

Table of Contents for
CBOW with multiple context words