A special type of data – text

Let's introduce another type of data. Text data is a frequently used input for machine learning algorithms since it contains a natural representation of data in our language. It's so rich that it also contains the answer to what we're looking for. The most common approach when dealing with text is to use a bag-of-words approach. According to this approach, every word becomes a feature and the text becomes a vector that contains non-zero elements for all the features (that is, the words) in its body. Given a text dataset, what's the number of features? It is simple. Just extract all the unique words in it and enumerate them. For a very rich text that uses all the English words, that number is in the 1 million range. If you're not going to further process it (removal of any third person, abbreviations, contractions, and acronyms), you might find yourself dealing with more than that, but that's a very rare case. In a plain and simple approach, which is the target of this book, we just let Python do its best.

The dataset used in this section is textual; it's the famous 20newsgroup (for more information about this, visit http://qwone.com/~jason/20Newsgroups/). It is a collection of about 20,000 documents that belong to 20 topics of newsgroups. It's one of the most frequently used (if not the topmost used) datasets that's presented while dealing with text classification and clustering. To import it, we're going to use its restricted subset, which contains all the science topics (medicine and space):

In: from sklearn.datasets import fetch_20newsgroups
categories = ['sci.med', 'sci.space']
twenty_sci_news = fetch_20newsgroups(categories=categories)

The first time you run this command, it automatically downloads the dataset and places it in the $HOME/scikit_learn_data/20news_home/ default directory. You can query the dataset object by asking for the location of the files, their content, and the label (that is, the topic of the discussion where the document was posted). They're located in the .filenames, .data, and .target attributes of the object, respectively:

In: print(twenty_sci_news.data[0])

Out: From: [email protected] ("F.Baube[tm]")
Subject: Vandalizing the sky
X-Added: Forwarded by Space Digest
Organization: [via International Space University]
Original-Sender: [email protected]
Distribution: sci
Lines: 12
From: "Phil G. Fraering" <[email protected]>

In: twenty_sci_news.filenames

Out: array([

In: print (twenty_sci_news.target[0])
print (twenty_sci_news.target_names[twenty_sci_news.target[0]])

Out: 1

The target is categorical, but it's represented as an integer (0 for sci.med and 1 for sci.space). If you want to read it, check against the index of the twenty_sci_news.target array.

The easiest way to deal with the text is by transforming the body of the dataset into a series of words. This means that, for each document, the number of times a specific word appears in the body will be counted.

For example, let's make a small, easy-to-process dataset:

  • Document_1: We love data science
  • Document_2: Data science is hard

In the entire dataset, which contains Document_1 and Document_2, there are only six different words: we, love, data, science, is, and hard. Given this array, we can associate each document with a feature vector:

In: Feature_Document_1 = [1 1 1 1 0 0]
Feature_Document_2 = [0 0 1 1 1 1]

Note that we're discarding the positions of the words and retaining only the number of times the word appears in the document. That's all.

In the 20newsletter database, with Python, this can be done in a simple way:

In: from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
word_count = count_vect.fit_transform(twenty_sci_news.data)

Out: (1187, 25638)

First, we instantiate a CountVectorizer object. Then, we call the method to count the terms in each document and produce a feature vector for each of them (fit_transform). We then query the matrix size. Note that the output matrix is sparse because it's very common to have only a limited selection of words for each document (since the number of non-zero elements in each line is very low and it makes no sense to store all the redundant zeros). Anyway, the output shape is (1187, 25638). The first value is the number of observations in the dataset (the number of documents), while the latter is the number of features (the number of unique words in the dataset).

After the CountVectorizer transforms, each document is associated with its feature vector. Let's take a look at the first document:

In: print (word_count[0])

Out: (0, 10827) 2
(0, 10501) 2
(0, 17170) 1
(0, 10341) 1
(0, 4762) 2
(0, 23381) 2
(0, 22345) 1
(0, 24461) 1
(0, 23137) 7

You will notice that the output is a sparse vector where only non-zero elements are stored. To check the direct correspondence to words, just try the following code:

In: word_list = count_vect.get_feature_names()
for n in word_count[0].indices:
print ('Word "%s" appears %i times' % (word_list[n],
word_count[0, n]))

Out: Word: from appears 2 times
Word: flb appears 2 times
Word: optiplan appears 1 times
Word: fi appears 1 times
Word: baube appears 2 times
Word: tm appears 2 times
Word: subject appears 1 times
Word: vandalizing appears 1 times
Word: the appears 7 times

So far, everything has been pretty simple, hasn't it? Let's move forward to another task of increasing complexity and effectiveness. Counting words is good, but we can manage more; we can compute their frequency. It's a measure that you can compare across differently-sized datasets. It gives an idea of whether a word is a stop word (that is, a very common word such as a, an, the, or is) or a rare, unique one. Typically, these terms are the most important because they're able to characterize an instance and the features based on these words, which are very discriminative in the learning process. To retrieve the frequency of each word in each document, try the following code:

In: from sklearn.feature_extraction.text import TfidfVectorizer
tf_vect = TfidfVectorizer(use_idf=False, norm='l1')
word_freq = tf_vect.fit_transform(twenty_sci_news.data)
word_list = tf_vect.get_feature_names()
for n in word_freq[0].indices:
print ('Word "%s" has frequency %0.3f' % (word_list[n],
word_freq[0, n]))

Out: Word "from" has frequency 0.022
Word "flb" has frequency 0.022
Word "optiplan" has frequency 0.011
Word "fi" has frequency 0.011
Word "baube" has frequency 0.022
Word "tm" has frequency 0.022
Word "subject" has frequency 0.011
Word "vandalizing" has frequency 0.011
Word "the" has frequency 0.077

The sum of the frequencies is 1 (or close to 1 due to the approximation). This happens because we chose the l1 norm. In this specific case, the word frequency is a probability distribution function. Sometimes, it's nice to increase the difference between rare and common words. In such cases, you can use the l2 norm to normalize the feature vector.

An even more effective way to vectorize text data is by using tf-idf. In brief, you can multiply the term frequency of the words that compose a document by the inverse document frequency of the word itself (that is, in the number of documents it appears in, or in its logarithmically scaled transformation). This is very handy for highlighting words that effectively describe each document and which are powerful discriminative elements among the dataset:

In: from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer() # Default: use_idf=True
word_tfidf = tfidf_vect.fit_transform(twenty_sci_news.data)
word_list = tfidf_vect.get_feature_names()
for n in word_tfidf[0].indices:
print ('Word "%s" has tf-idf %0.3f' % (word_list[n],
word_tfidf[0, n]))

Out: Word "fred" has tf-idf 0.089
Word "twilight" has tf-idf 0.139
Word "evening" has tf-idf 0.113
Word "in" has tf-idf 0.024
Word "presence" has tf-idf 0.119
Word "its" has tf-idf 0.061
Word "blare" has tf-idf 0.150
Word "freely" has tf-idf 0.119
Word "may" has tf-idf 0.054
Word "god" has tf-idf 0.119
Word "blessed" has tf-idf 0.150
Word "is" has tf-idf 0.026
Word "profiting" has tf-idf 0.150

In this example, the four most characterizing words of the first documents are caste, baube, flb, and tm (they have the highest tf-idf score). This means that their term frequency within the document is high, whereas they're pretty rare in the remaining documents.

So far, for each word, we have generated a feature. What about taking a couple of words together? That's exactly what happens when you consider bigrams instead of unigrams. With bigrams (or generically, n-grams), the presence or absence of a word as well as its neighbors  matters (that is, the words near it and their disposition). Of course, you can mix unigrams and n-grams and create a rich feature vector for each document. In the following simple example, let's test how n-grams work:

In: text_1 = 'we love data science'
text_2 = 'data science is hard'
documents = [text_1, text_2]

Out: ['we love data science', 'data science is hard']

In: # That is what we say above, the default one
count_vect_1_grams = CountVectorizer(ngram_range=(1, 1),
stop_words=[], min_df=1)
word_count = count_vect_1_grams.fit_transform(documents)
word_list = count_vect_1_grams.get_feature_names()
print ("Word list = ", word_list)
print ("text_1 is described with", [word_list[n] + "(" +
str(word_count[0, n]) + ")" for n in word_count[0].indices])

Out: Word list = ['data', 'hard', 'is', 'love', 'science', 'we']
text_1 is described with ['we(1)', 'love(1)', 'data(1)', 'science(1)']

In: # Now a bi-gram count vectorizer
count_vect_1_grams = CountVectorizer(ngram_range=(2, 2))
word_count = count_vect_1_grams.fit_transform(documents)
word_list = count_vect_1_grams.get_feature_names()
print ("Word list = ", word_list)
print ("text_1 is described with", [word_list[n] + "(" +
str(word_count[0, n]) + ")" for n in word_count[0].indices])

Out: Word list = ['data science', 'is hard', 'love data',
'science is', 'we love']
text_1 is described with ['we love(1)', 'love data(1)',
'data science(1)']

In: # Now a uni- and bi-gram count vectorizer
count_vect_1_grams = CountVectorizer(ngram_range=(1, 2))
word_count = count_vect_1_grams.fit_transform(documents)
word_list = count_vect_1_grams.get_feature_names()
print ("Word list = ", word_list)
print ("text_1 is described with", [word_list[n] + "(" +
str(word_count[0, n]) + ")" for n in word_count[0].indices])

Out: Word list = ['data', 'data science', 'hard', 'is', 'is hard', 'love',
'love data', 'science', 'science is', 'we', 'we love']
text_1 is described with ['we(1)', 'love(1)', 'data(1)', 'science(1)',
'we love(1)', 'love data(1)', 'data science(1)']

The preceding example very intuitively combines the first and second approach we previously presented. In this case, we used a CountVectorizer, but this approach is very common with a TfidfVectorizer. Note that the number of features explodes exponentially when you use n-grams.

If you have too many features (the dictionary may be too rich, there may be too many n-grams, or the computer may be just limited), you can use a trick that lowers the complexity of the problem (but you should first evaluate the trade-off performance/trade-off complexity). It's common to use the hashing trick where many words (or n-grams) are hashed and their hashes collide (which makes a bucket of words). Buckets are sets of semantically unrelated words but with colliding hashes. With HashingVectorizer(), as shown in the following example, you can decide on the number of buckets of words you want. The resulting matrix, of course, reflects your setting:

In: from sklearn.feature_extraction.text import HashingVectorizer
hash_vect = HashingVectorizer(n_features=1000)
word_hashed = hash_vect.fit_transform(twenty_sci_news.data)

Out: (1187, 1000)

Note that you can't invert the hashing process (since it's a digest operation). Therefore, after this transformation, you will have to work on the hashed features as they are. Hashing presents quite a few advantages: allowing quick transformation of a bag of words into vectors of features (hash buckets are our features, in this case), easily accommodating never-previously-seen words among the features, and avoiding overfitting by having unrelated words collide together in the same feature.

