Our achievements and goals

Our current text preprocessing phase includes the following steps:

  1. Firstly, tokenizing the text
  2. This is followed by throwing away words that occur way too often to be of any help in detecting relevant posts
  3. Throwing away words that occur so infrequently that there is little chance that they will occur in future posts
  4. Counting the remaining words
  5. Finally, calculating TF-IDF values from the counts, considering the whole text corpus

Again, we can congratulate ourselves. With this process, we are able to convert a bunch of noisy text into a concise representation of feature values.

But as simple and powerful the bag of words approach with its extensions is, it has some drawbacks, which we should be aware of:

  • It does not cover word relations: With the aforementioned vectorization approach, the text, car hits wall and, wall hits car, will both have the same feature vector
  • It does not capture negations correctly: For instance, the text, I will eat ice cream, and, I will not eat ice cream, will look very similar by means of their feature vectors although they contain quite the opposite meaning. This problem, however, can easily be mitigated by not only counting individual words, also called unigrams, but instead also considering bigrams (pairs of words) or trigrams (three words in a row)
  • It totally fails with misspelled words: Although it is clear to us that, database, and databas convey the same meaning, our approach will treat them as totally different words

For brevity's sake, let's nevertheless stick with the current approach, which we can now use to efficiently build clusters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset