Word2vec model

This model was created by Google in 2013 and is a predictive, deep learning-based model that computes and generates high quality, distributed, and continuous dense vector representations of words, which capture contextual and semantic similarity. Essentially, these are unsupervised models that can take in massive textual corpora, create a vocabulary of possible words, and generate dense word embeddings for each word in the vector space representing that vocabulary. Usually, you can specify the size of the word embedding vectors, and the total number of vectors is essentially the size of the vocabulary. This makes the dimensionality of this dense vector space much lower than the high-dimensional sparse vector space built using traditional BoW models.

There are two different model architectures that can be leveraged by Word2vec to create these word embedding representations. These are:

  • The Continuous Bag of Words (CBOW) model
  • The Skip-gram model

The CBOW model architecture tries to predict the current target word (the center word) based on the source context words (surrounding words). Considering a simple sentence, the quick brown fox jumps over the lazy dog, this can be pairs of (context_window, target_word), where if we consider a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy), and so on. Thus, the model tries to predict the target word based on context window words. The Word2vec family of models are unsupervised; this means that you can just give it a corpus without additional labels or information and it can construct dense word embeddings from the corpus. But you will still need to leverage a supervised classification methodology once you have this corpus to get to the embeddings. But we will do that from within the corpus itself, without any auxiliary information. We can model this CBOW architecture as a deep learning classification model, such that we take in the context words as our input, Xand try to predict the target word, Y. In fact, building this architecture is simpler than the skip-gram model, where we try to predict a whole bunch of context words from a source target word.

The skip-gram model architecture usually tries to achieve the reverse of what the CBOW model does. It tries to predict the source context words (surrounding words) given a target word (the center word). Consider our simple sentence from earlier, the quick brown fox jumps over the lazy dog. If we used the CBOW model, we get pairs of (context_window, target_word), where if we consider a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy)and so on. Now, considering that the skip-gram model's aim is to predict the context from the target word, the model typically inverts the contexts and targets, and tries to predict each context word from its target word.

Hence, the task becomes to predict the context [quick, fox] given the target word brown or [the, brown] given the target word quick, and so on. Thus, the model tries to predict the context_window words based on the target_word.

Following is the architecture diagram of the preceding two models:

An implementation of these models in Keras can be found in the following blog post by one of us: https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset