Backward propagation

We minimize the loss function using the gradient descent algorithm. So, we backpropagate the network, calculate the gradient of the loss function with respect to weights, and update the weights according to the weight update rule.

First, we compute the gradient of loss with respect to hidden to output layer . We cannot calculate the derivative of loss with respect to directly from as it has no term in it, so we apply the chain rule as shown below. It is basically the same as what we saw in the CBOW model, except that here we sum over all the context words:

First, let's compute the first term:

We know that is the error term, which is the difference between the actual word and the predicted word. For notation simplicity, we can write this sum over all the context words as:

So, we can say that:

Now, let's compute the second term. Since we know , we can write:

Thus, the gradient of loss with respect to is given as follows:

Now, we compute the gradient of loss with respect to the input to hidden layer weight . It is simple and exactly same as we saw in the CBOW model:

Thus, the gradient of loss with respect to is given as:

After computing the gradients, we update our weights W and W' as:

Thus, while training the network, we update the weights of our network using the preceding equation and obtain optimal weights. The optimal weight between the input to hidden layer, becomes the vector representation for the words in our vocabulary.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset