Learning Text Representations

Neural networks require inputs only in numbers. So when we have textual data, we convert them into numeric or vector representation and feed it to the network. There are various methods for converting the input text to numeric form. Some of the popular methods include term frequency-inverse document frequency (tf-idf), bag of words (BOW), and so on. However, these methods do not capture the semantics of the word. This means that these methods will not understand the meaning of the words.

In this chapter, we will learn about an algorithm called word2vec which converts the textual input to a meaningful vector. They learn the semantic vector representation for each word in the given input text. We will start off the chapter by understanding about word2vec model and two different types of word2vec model called continuous bag-of-words (CBOW) and skip-gram model. Next, we will learn how to build word2vec model using gensim library and how to visualize high dimensional word embeddings in tensorboard.

Going ahead, we will learn about doc2vec model which is used for learning the representations for a document. We will understand two different methods in doc2vec called Paragraph Vector - Distributed Memory Model (PV-DM) and Paragraph Vector - Distributed Bag of Words (PV-DBOW). We will also see how to perform document classification using doc2vec. At the end of the chapter, we will learn about skip-thoughts algorithms and quick thoughts algorithm which is used for learning the sentence representations.

In this chapter, we will understand the following topics:

  • The word2vec model
  • Building a word2vec model using gensim
  • Visualizing word embeddings in TensorBoard
  • Doc2vec model
  • Finding similar documents using doc2vec
  • Skip-thoughts
  • Quick-thoughts
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset