Forward propagation in skip-gram

First, we will understand how forward propagation works in the skip-gram model. Let's use the same notations we used in the CBOW model. The architecture of the skip-gram model is shown in the following figure. As you can see, we feed only one target word as an input and it returns the context words as an output :

Similar to what we saw in CBOW, in the Forward propagation section, first we multiply our input with the input to hidden layer weights :

We can directly rewrite the preceding equation as:

Here, implies the vector representation for the input word .

Next, we compute , which implies a similarity score between the word word in our vocabulary and the input target word. Similar to what we saw in the CBOW model, can be given as:

We can directly rewrite the above equation as:

Here, implies the vector representation of the word .

But, unlike the CBOW model where we just predicted the one target word, here we are predicting the number of context words. So, we can rewrite the above equation as:

Thus, implies the score for the word in the vocabulary to be the context word . That is:

  • implies the score for the word to be the first context word
  • implies the score for the word to be the second context word
  • implies the score for the word to be the third context word

And since we want to convert our scores to probabilities, we apply the softmax function and compute :

Here, implies the probability of the word in the vocabulary to be the context word .

Now, let us see how to compute the loss function. Let denote the probability of the correct context word. So, we need to maximize this probability:

Instead of maximizing raw probabilities, maximize the log probabilities:

Similar to what we saw in the CBOW model, we convert this into the minimization objective function by adding the negative sign:

Substituting equation (8) in the preceding equation, we can write the following:

Since we have context words, we take the product sum of the probabilities as:

So, according to logarithm rules, we can rewrite the above equation and our final loss function becomes:

Look at the loss function of the CBOW and skip-gram models. You'll notice that the only difference between the CBOW loss function and skip-gram loss function is the addition of the context word .

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset