Forward propagation in skip-gram

First, we will understand how forward propagation works in the skip-gram model. Let's use the same notations we used in the CBOW model. The architecture of the skip-gram model is shown in the following figure. As you can see, we feed only one target word as an input and it returns the context words as an output :

Similar to what we saw in CBOW, in the Forward propagation section, first we multiply our input with the input to hidden layer weights :

We can directly rewrite the preceding equation as:

Here, implies the vector representation for the input word .

Next, we compute , which implies a similarity score between the word word in our vocabulary and the input target word. Similar to what we saw in the CBOW model, can be given as:

We can directly rewrite the above equation as:

Here, implies the vector representation of the word .

But, unlike the CBOW model where we just predicted the one target word, here we are predicting the number of context words. So, we can rewrite the above equation as:

Thus, implies the score for the word in the vocabulary to be the context word . That is:

implies the score for the word to be the first context word
implies the score for the word to be the second context word
implies the score for the word to be the third context word

And since we want to convert our scores to probabilities, we apply the softmax function and compute :

Here, implies the probability of the word in the vocabulary to be the context word .

Now, let us see how to compute the loss function. Let denote the probability of the correct context word. So, we need to maximize this probability:

Instead of maximizing raw probabilities, maximize the log probabilities:

Similar to what we saw in the CBOW model, we convert this into the minimization objective function by adding the negative sign:

Substituting equation (8) in the preceding equation, we can write the following:

Since we have context words, we take the product sum of the probabilities as:

So, according to logarithm rules, we can rewrite the above equation and our final loss function becomes:

Look at the loss function of the CBOW and skip-gram models. You'll notice that the only difference between the CBOW loss function and skip-gram loss function is the addition of the context word .

Table of Contents for Forward propagation in skip-gram

Create new playlist

Sign In

Sign Up

Table of Contents for
Forward propagation in skip-gram