Skip-gram Word2Vec implementation

After understanding the mathematical details of how skip-gram models work, we are going to implement skip-gram, which encodes words into real-valued vectors that have certain properties (hence the name Word2Vec). By implementing this architecture, you will get a clue of how the process of learning another representation works.

Text is the main input for a lot of natural language processing applications such as machine translation, sentiment analysis, and text to speech systems. So, learning a real-valued representation for the text will help us use different deep learning techniques for these tasks.

In the early chapters of this book, we introduced something called one-hot encoding, which produces a vector of zeros except for the index of the word that this vector represents. So, you may wonder why we are not using it here. This method is very inefficient because usually you have a big set of distinct words, maybe something like 50,000 words, and using one-hot encoding for this will produce a vector of 49,999 entries set to zero and only one entry set to one.

Having a very sparse input like this will result in a huge waste of computation because of the matrix multiplications that we'd do in the hidden layers of the neural network.

Figure 15.8: One-hot encoding which will result in huge waste of computation

As we mentioned previously, the outcome of using one-hot encoding will be a very sparse vector, especially when you have a huge amount of distinct words that you want to encode.

The following figure shows that when we multiply this sparse vector of all zeros except for one entry by a matrix of weights, the output will be only the row of the matrix that corresponds to the one value of the sparse vector:

Figure 15.9: The effect of multiplying a one-hot vector with almost all zeros by hidden layer weight matrix

To avoid this huge waste of computation, we will be using embeddings, which is just a fully-connected layer with some embedding weights. In this layer, we skip this inefficient multiplication and look up the embedding weights of the embedding layer from something called weight matrix.

So, instead of the waste that results from the computation, we are going to use this weight lookup this weight matrix to find the embedding weights. First, need to build this lookup take. To do this, we are going to encode all the input words as integers, as shown in the following figure, and then to get the corresponding values for this word, we are going to use its integer representation as the row number in this weight matrix. The process of finding the corresponding embedding values of a specific word is called embedding lookup. As mentioned previously, the embedding layer will be just a fully connected layer, where the number of units represents the embedding dimension.

Figure 15.10: Tokenized lookup table

You can see that this process is very intuitive and straightforward; we just need to follow these steps:

  1. Define the lookup table that will be considered as a weight matrix
  2. Define the embedding layer as a fully connected hidden layer with specific number of units (embedding dimensions)
  3. Use the weight matrix lookup as an alternative for the computationally unnecessary matrix multiplication
  4. Finally, train the lookup table as any weight matrix

As we mentioned earlier, we are going to build a skip-gram Word2Vec model in this section, which is an efficient way of learning a representation for words while preserving the semantic information that the words have.

So, let's go ahead and build a Word2Vec model using the skip-gram architecture, which is proven to better than others.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset