Bag of words feature extraction

Text feature extraction is the process of transforming what is essentially a list of words into a feature set that is usable by a classifier. The NLTK classifiers expect dict style feature sets, so we must therefore transform our text into a dict. The bag of words model is the simplest method; it constructs a word presence feature set from all the words of an instance. This method doesn't care about the order of the words, or how many times a word occurs, all that matters is whether the word is present in a list of words.

How to do it...

The idea is to convert a list of words into a dict, where each word becomes a key with the value True. The bag_of_words() function in looks like this:

def bag_of_words(words):
  return dict([(word, True) for word in words])

We can use it with a list of words; in this case, the tokenized sentence the quick brown fox:

>>> from featx import bag_of_words
>>> bag_of_words(['the', 'quick', 'brown', 'fox'])
{'quick': True, 'brown': True, 'the': True, 'fox': True}

The resulting dict is known as a bag of words because the words are not in order, and it doesn't matter where in the list of words they occurred, or how many times they occurred. All that matters is that the word is found at least once.


You can use different values than True, but it is important to keep in mind that the NLTK classifiers learn from the unique combination of (key, value). That means that ('fox', 1) is treated as a different feature than ('fox', 2).

How it works...

The bag_of_words() function is a very simple list comprehension that constructs a dict from the given words, where every word gets the value True.

Since we have to assign a value to each word in order to create a dict, True is a logical choice for the value to indicate word presence. If we knew the universe of all possible words, we could assign the value False to all the words that are not in the given list of words. But most of the time, we don't know all the possible words beforehand. Plus, the dict that would result from assigning False to every possible word would be very large (assuming all words in the English language are possible). So instead, to keep feature extraction simple and use less memory, we stick to assigning the value True to all words that occur at least once. We don't assign the value False to any word since we don't know what the set of possible words are; we only know about the words we are given.

There's more...

In the default bag of words model, all words are treated equally. But that's not always a good idea. As we already know, some words are so common that they are practically meaningless. If you have a set of words that you want to exclude, you can use the bag_of_words_not_in_set() function in

def bag_of_words_not_in_set(words, badwords):
  return bag_of_words(set(words) - set(badwords))

This function can be used, among other things, to filter stopwords. Here's an example where we filter the word the from the quick brown fox:

>>> from featx import bag_of_words_not_in_set
>>> bag_of_words_not_in_set(['the', 'quick', 'brown', 'fox'], ['the'])
{'quick': True, 'brown': True, 'fox': True}

As expected, the resulting dict has quick, brown, and fox, but not the.

Filtering stopwords

Stopwords are words that are often useless in NLP, in that they don't convey much meaning, such as the word the. Here's an example of using the bag_of_words_not_in_set() function to filter all English stopwords:

from nltk.corpus import stopwords

def bag_of_non_stopwords(words, stopfile='english'):
  badwords = stopwords.words(stopfile)
  return bag_of_words_not_in_set(words, badwords)

You can pass a different language filename as the stopfile keyword argument if you are using a language other than English. Using this function produces the same result as the previous example:

>>> from featx import bag_of_non_stopwords
>>> bag_of_non_stopwords(['the', 'quick', 'brown', 'fox'])
{'quick': True, 'brown': True, 'fox': True}

Here, the is a stopword, so it is not present in the returned dict.

Including significant bigrams

In addition to single words, it often helps to include significant bigrams. As significant bigrams are less common than most individual words, including them in the bag of words model can help the classifier make better decisions. We can use the BigramCollocationFinder class covered in the Discovering word collocations recipe of Chapter 1, Tokenizing Text and WordNet Basics, to find significant bigrams. The bag_of_bigrams_words() function found in will return a dict of all words along with the 200 most significant bigrams:

from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def bag_of_bigrams_words(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
  bigram_finder = BigramCollocationFinder.from_words(words)
  bigrams = bigram_finder.nbest(score_fn, n)
  return bag_of_words(words + bigrams)

The bigrams will be present in the returned dict as (word1, word2) and will have the value as True. Using the same example words as we did earlier, we get all words plus every bigram:

>>> from featx import bag_of_bigrams_words
>>> bag_of_bigrams_words(['the', 'quick', 'brown', 'fox'])
{'brown': True, ('brown', 'fox'): True, ('the', 'quick'): 
True, 'fox': True, ('quick', 'brown'): True, 'quick': True, 'the': True}

You can change the maximum number of bigrams found by altering the keyword argument n.

See also

The Discovering word collocations recipe of Chapter 1, Tokenizing Text and WordNet Basics, covers the BigramCollocationFinder class in more detail. In the next recipe, we will train a NaiveBayesClassifier class using feature sets created with the bag of words model.

