In this section, we will go through some deeper details of how can we build a Word2Vec model. As we mentioned previously, our final goal is to have a trained model that will able to generate real-valued vector representation for the input textual data which is also called word embeddings.
During the training of the model, we will use the maximum likelihood method (https://en.wikipedia.org/wiki/Maximum_likelihood), which can be used to maximize the probability of the next word wt in the input sentence given the previous words that the model has seen, which we can call h.
This maximum likelihood method will be expressed in terms of the softmax function:
Here, the score function computes a value to represent the compatibility of the target word wt with respect to the context h. This model will be trained on the input sequences while training to maximize the likelihood on the training input data (log likelihood is used for mathematical simplicity and derivation with the log):
So, the ML method will try to maximize the above equation which, will result in a probabilistic language model. But the calculation of this is very computationally expensive, as we need to compute each probability using the score function for all the words in the
vocabulary V words w', in the corresponding current context h of this model. This will happen at every training step.
Because of the computational expensiveness of building the probabilistic language model, people tend to use different techniques that are less computationally expensive, such as Continuous Bag-of-Words (CBOW) and skip-gram models.
These models are trained to build a binary classification with logistic regression to separate between the real target words wt and h noise or imaginary words , which is in the same context. The following diagram simplifies this idea using the CBOW technique:
The next diagram, shows the two architectures that you can use for building the Word2Vec model:
To be more formal, the objective function of these techniques maximizes the following:
Where:
- is the probability of the binary logistic regression based on the model seeing the word w in the context h in the dataset D, which is calculated in terms of the θ vector. This vector represents the learned embeddings.
- is the imaginary or noisy words that we can generate from a noisy probabilistic distribution, such as the unigram of the training input examples.
To sum up, the objective of these models is to discriminate between real and imaginary inputs, and hence assign higher probability to real words and less probability for the case of imaginary or noisy words.
This objective is maximized when the model assigns high probabilities to real words and low probabilities to noise words.