Implementing Natural Language Processing

In this chapter, we will discuss word vectors (Word2Vec) and paragraph vectors (Doc2Vec) in DL4J. We will develop a complete running example step by step, covering all the stages, such as ETL, model configuration, training, and evaluation. Word2Vec and Doc2Vec are natural language processing (NLP) implementations in DL4J. It is worth mentioning a little about the bag-of-words algorithm before we talk about Word2Vec.

Bag-of-words is an algorithm that counts the instances of words in documents. This will allow us to perform document classification. Bag of words and Word2Vec are just two different types of text classification. Word2Vec can use a bag of words extracted from a document to create vectors. In addition to these text classification methods, term frequency–inverse document frequency (TF-IDF) can be used to judge the topic/context of the document. In the case of TF-IDF, a score will be calculated for all the words, and word counts will be replaced with this score. TF-IDF is a simple scoring scheme, but word embeddings may be a better choice, as the semantic similarity can be captured by word embedding. Also, if your dataset is small and the context is domain-specific, then bag of words may be a better choice than Word2Vec.

Word2Vec is a two-layer neural network that processes text. It converts the text corpus to vectors.

Note that Word2Vec is not a deep neural network (DNN). It transforms text data into a numerical format that a DNN can understand, making customization possible.

We can even combine Word2Vec with DNNs to serve this purpose. It doesn't train the input words through reconstruction; instead, it trains words using the neighboring words in the corpus.

Doc2Vec (paragraph vectors) associates documents with labels, and is an extension of Word2Vec. Word2Vec tries to correlate words with words, while Doc2Vec (paragraph vectors) correlates words with labels. Once we represent documents in vector formats, we can then use these formats as an input to a supervised learning algorithm to map these vectors to labels.

In this chapter, we will cover the following recipes:

  • Reading and loading text data
  • Tokenizing data and training the model
  • Evaluating the model
  • Generating plots from the model
  • Saving and reloading the model
  • Importing Google News vectors
  • Troubleshooting and tuning Word2Vec models
  • Using Word2Vec for sentence classification using CNNs
  • Using Doc2Vec for document classification
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset