A practical example of the skip-gram architecture

Let's go through a practical example and see how skip-gram models will work in this situation:

the quick brown fox jumped over the lazy dog

First off, we need to make a dataset of words and their corresponding context. Defining the context is up to us, but it has to make sense. So, we'll take a window around the target word and take a word from the right and another from the left.

By following this contextual technique, we will end up with the following set of words and their corresponding context:

([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...

The generated words and their corresponding context will be represented as pairs of (context, target). The idea of skip-gram models is the inverse of CBOW ones. In the skip- gram model, we will try to predict the context of the word based on its target word. For example, considering the first pair, the skip-gram model will try to predict the and brown from the target word quick, and so on. So, we can rewrite our dataset as follows:

(quick, the), (quick, brown), (brown, quick), (brown, fox), ...

Now, we have a set of input and output pairs.

Let's try to mimic the training process at specific step t. So, the skip-gram model will take the first training sample where the input is the word quick and the target output is the word the. Next, we need to construct the noisy input as well, so we are going to randomly select from the unigrams of the input data. For simplicity, the size of the noisy vector will be only one. For example, we can select the word sheep as a noisy example.

Now, we can go ahead and compute the loss between the real pair and the noisy one as:

The goal in this case is to update the θ parameter to improve the previous objective function. Typically, we can use gradient for this. So, we will try to calculate the gradient of the loss with respect to the objective function parameter θ, which will be represented by .

After the training process, we can visualize some results based on their reduced dimensions of the real-valued vector representation. You will find that this vector space is very interesting because you can do lots of interesting stuff with it. For example, you can learn Analogy in this space by saying that king is to queen as man is to woman. We can even derive the woman vector by subtracting the king vector from the queen one and adding the man; the result of this will be very close to the actual learned vector of the woman. You can also learn geography in this space.

Figure 15.7: Projection of the learned vectors to two dimensions using t-distributed stochastic neighbor embedding (t-SNE) dimensionality reduction technique

The preceding example gives very good intuition behind these vectors and how they'll be useful for most NLP applications such as machine translation or part-of-speech (POStagging.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset