Building Word2Vec model

In this section, we will go through some deeper details of how can we build a Word2Vec model. As we mentioned previously, our final goal is to have a trained model that will able to generate real-valued vector representation for the input textual data which is also called word embeddings.

During the training of the model, we will use the maximum likelihood method (https://en.wikipedia.org/wiki/Maximum_likelihood), which can be used to maximize the probability of the next word wt in the input sentence given the previous words that the model has seen, which we can call h.

This maximum likelihood method will be expressed in terms of the softmax function:

Here, the score function computes a value to represent the compatibility of the target word wt with respect to the context h. This model will be trained on the input sequences while training to maximize the likelihood on the training input data (log likelihood is used for mathematical simplicity and derivation with the log):

So, the ML method will try to maximize the above equation which, will result in a probabilistic language model. But the calculation of this is very computationally expensive, as we need to compute each probability using the score function for all the words in the
vocabulary V words w', in the corresponding current context h of this model. This will happen at every training step.

Figure 15.4: General architecture of a probabilistic language model

Because of the computational expensiveness of building the probabilistic language model, people tend to use different techniques that are less computationally expensive, such as Continuous Bag-of-Words (CBOW) and skip-gram models.

These models are trained to build a binary classification with logistic regression to separate between the real target words wt and h noise or imaginary words , which is in the same context. The following diagram simplifies this idea using the CBOW technique:

Figure 15.5: General architecture of skip-gram model

The next diagram, shows the two architectures that you can use for building the Word2Vec model:

Figure 15.6: different architectures for the Word2Vec model

To be more formal, the objective function of these techniques maximizes the following:

Where:

  • is the probability of the binary logistic regression based on the model seeing the word w in the context h in the dataset D, which is calculated in terms of the θ vector. This vector represents the learned embeddings.
  • is the imaginary or noisy words that we can generate from a noisy probabilistic distribution, such as the unigram of the training input examples.

To sum up, the objective of these models is to discriminate between real and imaginary inputs, and hence assign higher probability to real words and less probability for the case of imaginary or noisy words.

This objective is maximized when the model assigns high probabilities to real words and low probabilities to noise words.

Technically, the process of assigning high probability to real words is is called negative sampling (https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf), and there is good mathematical motivation for using this loss function: the updates it proposes approximate the updates of the softmax function in the limit. But computationally, it is especially appealing because computing the loss function now scales only with the number of noise words that we select (k), and not all words in the vocabulary (V). This makes it much faster to train. We will actually make use of the very similar noise-contrastive estimation (NCE) (https://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf) loss, for which TensorFlow has a handy helper function, tf.nn.nce_loss().
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset