First, we will understand how forward propagation works in the skip-gram model. Let's use the same notations we used in the CBOW model. The architecture of the skip-gram model is shown in the following figure. As you can see, we feed only one target word as an input and it returns the context words as an output :
Similar to what we saw in CBOW, in the Forward propagation section, first we multiply our input with the input to hidden layer weights :
We can directly rewrite the preceding equation as:
Here, implies the vector representation for the input word .
Next, we compute , which implies a similarity score between the word word in our vocabulary and the input target word. Similar to what we saw in the CBOW model, can be given as:
We can directly rewrite the above equation as:
Here, implies the vector representation of the word .
But, unlike the CBOW model where we just predicted the one target word, here we are predicting the number of context words. So, we can rewrite the above equation as:
Thus, implies the score for the word in the vocabulary to be the context word . That is:
- implies the score for the word to be the first context word
- implies the score for the word to be the second context word
- implies the score for the word to be the third context word
And since we want to convert our scores to probabilities, we apply the softmax function and compute :
Here, implies the probability of the word in the vocabulary to be the context word .
Now, let us see how to compute the loss function. Let denote the probability of the correct context word. So, we need to maximize this probability:
Instead of maximizing raw probabilities, maximize the log probabilities:
Similar to what we saw in the CBOW model, we convert this into the minimization objective function by adding the negative sign:
Substituting equation (8) in the preceding equation, we can write the following:
Since we have context words, we take the product sum of the probabilities as:
So, according to logarithm rules, we can rewrite the above equation and our final loss function becomes:
Look at the loss function of the CBOW and skip-gram models. You'll notice that the only difference between the CBOW loss function and skip-gram loss function is the addition of the context word .