The classical approach

Traditionally, the approach for building word representations used things such as the bag-of-words model. In this model, word representations considered individual words to be independent of one another. Hence, such representations often used a one-hot representation, which indicated the presence or absence of words in a sentence, to generate a vector representation in a sentence or document. However, such a representation is seldom useful in real-world applications, where word meanings change based on the words surrounding them. For example, let us consider the following sentences: The cat sat on the broken wall, and, The dog jumped over the brick structure. It is quite evident from these two sentences that although, they are discussing two separate events, the semantic meanings of the sentences are similar to one another. For instance, a dog is similar to a cat, as they share an entity called animal, while a wall could be viewed as similar to a brick structure. Hence, while the sentences discuss different events, they are semantically related to one another. In the classical approach for encoding words using the bag-of-words model (where words are encoded in their own dimensions), encoding such a semantic similarity is not possible.

Let us consider the following sentences:

  • TensorFlow is an open source software library
  • Python is an open source interpreted software programming language

If we consider the earlier two lines of text to be separate documents, we can construct two lists of words:

  • [TensorFlow, is, an, open, source, software, library]
  • [Python, is, an, open, source, interpreted, software, programming, language]

The vocabulary of the following documents can be written as: [TensorFlow, is, an, open, source, software, library, Python, interpreted, programming, language], which is 11 words long.

Hence, we can represent the first document as follows:

  • [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
  • [0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1]

In the preceding representation, every number indicates the number of times the word in that position of the vocabulary repeats itself in the document. Hence, we can see that when the vocabulary increases, most of the words in the vocabulary will not be present in each document, making it a long, and mostly empty (zeros), vector representation. The size of the vector will be the size of the vocabulary, however large.

Another important aspect that is lost with classical methods is the order in which the words occur in the sentence. The traditional bag-of-words approach aggregates the vocabulary of the text present in the documents, in order to obtain a representation of the words that are present. However, this is a drawback where the context is lost; similar to the encoding discussed previously, it assumes that words in the document are independent of one another. Another pitfall of this approach is that such a representation leads to data sparsity, which would make it difficult to train statistical models. This forms the fundamental motivation for using vector representations for words, where the semantics of the words are encoded in the representation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset