Doc2vec

A simple extension of the Word2vec model, applied to the document level, was proposed by Mikilov et al. In this method, in order to obtain document vectors, a unique document ID is appended to the document. It is trained with the words in the document to produce an average (or concatenated) of the word embeddings, in order to produce a document embedding. Hence, in the example that we discussed earlier, the doc2vec model data would look as follows:

TensorFlow is an open source software library
Python is an open source interpreted software programming language

Contrary to the earlier approach, the document lists now look as follows:

[DOC_01, TensorFlow, is, an, open, source, software, library]
[DOC_02, Python, is, an, open, source, interpreted, software, programming, language]

This doc2vec model looks very similar to the approach that we discussed with CBOW. Hence, the document vector, D, for every document, is trained simultaneously, when we train the word vectors, W. This results in training the word vectors as well as the document vectors, which are accessible to us when the training is complete. The reason that this method is similar to CBOW is that the model tries to predict the target word, given the context word. Hence, in the example, DOC_01, TensorFlow, is, and an, are used to predict open. This model is called the Paragraph Vector Distributed Memory (PV-DM). The idea behind a document vector is to represent the semantic context of the topic that is discussed in the document.

Similar to Word2vec, a variant of the PV-DM model is the Paragraph Vector – Distributed Bag of Words (PV-DBOW), which is similar to the skip-gram model of Word2vec. Hence, in this version, the model proposes the target word vectors, given a context word. For instance, DOC_01 is used to predict the words, TensorFlow, is, an, and open. One potential advantage that doc2vec bears over Word2vec, is that the word vectors do not need to be stored.

Table of Contents for Doc2vec

Create new playlist

Sign In

Sign Up

Table of Contents for
Doc2vec