Text feature extraction is the process of transforming what is essentially a list of words into a feature set that is usable by a classifier. The NLTK classifiers expect
dict
style feature sets, so we must therefore transform our text into a dict
. The bag of words model is the simplest method; it constructs a word presence feature set from all the words of an instance. This method doesn't care about the order of the words, or how many times a word occurs, all that matters is whether the word is present in a list of words.
The idea is to convert a list of words into a dict
, where each word becomes a key with the value True
. The bag_of_words()
function in featx.py
looks like this:
def bag_of_words(words): return dict([(word, True) for word in words])
We can use it with a list of words; in this case, the tokenized sentence the quick brown fox
:
>>> from featx import bag_of_words >>> bag_of_words(['the', 'quick', 'brown', 'fox']) {'quick': True, 'brown': True, 'the': True, 'fox': True}
The resulting dict
is known as a bag of words because the words are not in order, and it doesn't matter where in the list of words they occurred, or how many times they occurred. All that matters is that the word is found at least once.
The bag_of_words()
function is a very simple list comprehension that constructs a dict
from the given words, where every word gets the value True
.
Since we have to assign a value to each word in order to create a dict
, True
is a logical choice for the value to indicate word presence. If we knew the universe of all possible words, we could assign the value False
to all the words that are not in the given list of words. But most of the time, we don't know all the possible words beforehand. Plus, the dict
that would result from assigning False
to every possible word would be very large (assuming all words in the English language are possible). So instead, to keep feature extraction simple and use less memory, we stick to assigning the value True
to all words that occur at least once. We don't assign the value False
to any word since we don't know what the set of possible words are; we only know about the words we are given.
In the default bag of words model, all words are treated equally. But that's not always a good idea. As we already know, some words are so common that they are practically meaningless. If you have a set of words that you want to exclude, you can use the bag_of_words_not_in_set()
function in featx.py
:
def bag_of_words_not_in_set(words, badwords): return bag_of_words(set(words) - set(badwords))
This function can be used, among other things, to filter stopwords. Here's an example where we filter the word the
from the quick brown fox
:
>>> from featx import bag_of_words_not_in_set >>> bag_of_words_not_in_set(['the', 'quick', 'brown', 'fox'], ['the']) {'quick': True, 'brown': True, 'fox': True}
As expected, the resulting dict
has quick
, brown
, and fox
, but not the
.
Stopwords are words that are often useless in NLP, in that they don't convey much meaning, such as the word the
. Here's an example of using the bag_of_words_not_in_set()
function to filter all English stopwords:
from nltk.corpus import stopwords def bag_of_non_stopwords(words, stopfile='english'): badwords = stopwords.words(stopfile) return bag_of_words_not_in_set(words, badwords)
You can pass a different language filename as the stopfile
keyword argument if you are using a language other than English. Using this function produces the same result as the previous example:
>>> from featx import bag_of_non_stopwords >>> bag_of_non_stopwords(['the', 'quick', 'brown', 'fox']) {'quick': True, 'brown': True, 'fox': True}
Here, the
is a stopword, so it is not present in the returned dict
.
In addition to single words, it often helps to include significant bigrams. As significant bigrams are less common than most individual words, including them in the bag of words model can help the classifier make better decisions. We can use the BigramCollocationFinder
class covered in the Discovering word collocations recipe of Chapter 1, Tokenizing Text and WordNet Basics, to find significant bigrams. The bag_of_bigrams_words()
function found in featx.py
will return a dict
of all words along with the 200 most significant bigrams:
from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures def bag_of_bigrams_words(words, score_fn=BigramAssocMeasures.chi_sq, n=200): bigram_finder = BigramCollocationFinder.from_words(words) bigrams = bigram_finder.nbest(score_fn, n) return bag_of_words(words + bigrams)
The bigrams will be present in the returned dict
as (word1, word2)
and will have the value as True
. Using the same example words as we did earlier, we get all words plus every bigram:
>>> from featx import bag_of_bigrams_words >>> bag_of_bigrams_words(['the', 'quick', 'brown', 'fox']) {'brown': True, ('brown', 'fox'): True, ('the', 'quick'): True, 'fox': True, ('quick', 'brown'): True, 'quick': True, 'the': True}
You can change the maximum number of bigrams found by altering the keyword argument n.
The Discovering word collocations recipe of Chapter 1, Tokenizing Text and WordNet Basics, covers the BigramCollocationFinder
class in more detail. In the next recipe, we will train a NaiveBayesClassifier
class using feature sets created with the bag of words model.