Extracting features from text

Many machine learning problems use text as an explanatory variable. Text must be transformed to a different representation that encodes as much of its meaning as possible in a feature vector. In the following sections we will review variations of the most common representation of text that is used in machine learning: the bag-of-words model.

The bag-of-words representation

The most common representation of text is the bag-of-words model. This representation uses a multiset, or bag, that encodes the words that appear in a text; the bag-of-words does not encode any of the text's syntax, ignores the order of words, and disregards all grammar. Bag-of-words can be thought of as an extension to one-hot encoding. It creates one feature for each word of interest in the text. The bag-of-words model is motivated by the intuition that documents containing similar words often have similar meanings. The bag-of-words model can be used effectively for document classification and retrieval despite the limited information that it encodes.

A collection of documents is called a corpus. Let's use a corpus with the following two documents to examine the bag-of-words model:

corpus = [
    'UNC played Duke in basketball',
    'Duke lost the basketball game'
]

This corpus contains eight unique words: UNC, played, Duke, in, basketball, lost, the, and game. The corpus's unique words comprise its vocabulary. The bag-of-words model uses a feature vector with an element for each of the words in the corpus's vocabulary to represent each document. Our corpus has eight unique words, so each document will be represented by a vector with eight elements. The number of elements that comprise a feature vector is called the vector's dimension. A dictionary maps the vocabulary to indices in the feature vector.

In the most basic bag-of-words representation, each element in the feature vector is a binary value that represents whether or not the corresponding word appeared in the document. For example, the first word in the first document is UNC. The first word in the dictionary is UNC, so the first element in the vector is equal to one. The last word in the dictionary is game. The first document does not contain the word game, so the eighth element in its vector is set to 0. The CountVectorizer class can produce a bag-of-words representation from a string or file. By default, CountVectorizer converts the characters in the documents to lowercase, and tokenizes the documents. Tokenization is the process of splitting a string into tokens, or meaningful sequences of characters. Tokens frequently are words, but they may also be shorter sequences including punctuation characters and affixes. The CountVectorizer class tokenizes using a regular expression that splits strings on whitespace and extracts sequences of characters that are two or more characters in length.

The documents in our corpus are represented by the following feature vectors:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
>>>     'UNC played Duke in basketball',
>>>     'Duke lost the basketball game'
>>> ]
>>> vectorizer = CountVectorizer()
>>> print vectorizer.fit_transform(corpus).todense()
>>> print vectorizer.vocabulary_
[[1 1 0 1 0 1 0 1]
 [1 1 1 0 1 0 1 0]]
{u'duke': 1, u'basketball': 0, u'lost': 4, u'played': 5, u'game': 2, u'unc': 7, u'in': 3, u'the': 6}

Now let's add a third document to our corpus:

corpus = [
    'UNC played Duke in basketball',
    'Duke lost the basketball game',
    'I ate a sandwich'
]

Our corpus's dictionary now contains the following ten unique words. Note that I and a were not extracted as they do not match the default regular expression that CountVectorizer uses to tokenize strings:

{u'duke': 2, u'basketball': 1, u'lost': 5, u'played': 6, u'in': 4, u'game': 3, u'sandwich': 7, u'unc': 9, u'ate': 0, u'the': 8} 

Now, our feature vectors are as follows:

UNC played Duke in basketball = [[0 1 1 0 1 0 1 0 0 1]]
Duke lost the basketball game = [[0 1 1 1 0 1 0 0 1 0]]
I ate a sandwich = [[1 0 0 0 0 0 0 1 0 0]]

The meanings of the first two documents are more similar to each other than they are to the third document, and their corresponding feature vectors are more similar to each other than they are to the third document's feature vector when using a metric such as Euclidean distance. The Euclidean distance between two vectors is equal to the Euclidean norm, or L2 norm, of the difference between the two vectors:

The bag-of-words representation

Recall that the Euclidean norm of a vector is equal to the vector's magnitude, which is given by the following equation:

The bag-of-words representation

scikit-learn's euclidean_distances function can be used to calculate the distance between two or more vectors, and it confirms that the most semantically similar documents are also the closest to each other in space. In the following example, we will use the euclidean_distances function to compare the feature vectors for our documents:

>>> from sklearn.metrics.pairwise import euclidean_distances
>>> counts = [
>>>     [0, 1, 1, 0, 0, 1, 0, 1],
>>>     [0, 1, 1, 1, 1, 0, 0, 0],
>>>     [1, 0, 0, 0, 0, 0, 1, 0]
>>> ]
>>> print 'Distance between 1st and 2nd documents:', euclidean_distances(counts[0], counts[1])
>>> print 'Distance between 1st and 3rd documents:', euclidean_distances(counts[0], counts[2])
>>> print 'Distance between 2nd and 3rd documents:', euclidean_distances(counts[1], counts[2])
Distance between 1st and 2nd documents: [[ 2.]]
Distance between 1st and 3rd documents: [[ 2.44948974]]
Distance between 2nd and 3rd documents: [[ 2.44948974]]

Now let's assume that we are using a corpus of news articles instead of our toy corpus. Our dictionary may now have hundreds of thousands of unique words instead of only twelve. The feature vectors representing the articles will each have hundreds of thousands of elements, and many of the elements will be zero. Most sports articles will not have any of the words particular to finance articles and most culture articles will not have any of the words particular to articles about finance. High-dimensional feature vectors that have many zero-valued elements are called sparse vectors.

Using high-dimensional data creates several problems for all machine learning tasks, including those that do not involve text. The first problem is that high-dimensional vectors require more memory than smaller vectors. NumPy provides some data types that mitigate this problem by efficiently representing only the nonzero elements of sparse vectors.

The second problem is known as the curse of dimensionality, or the Hughes effect. As the feature space's dimensionality increases, more training data is required to ensure that there are enough training instances with each combination of the feature's values. If there are insufficient training instances for a feature, the algorithm may overfit noise in the training data and fail to generalize. In the following sections, we will review several strategies to reduce the dimensionality of text features. In Chapter 7, Dimensionality Reduction with PCA, we will review techniques for numerical dimensionality reduction.

Stop-word filtering

A basic strategy to reduce the dimensions of the feature space is to convert all of the text to lowercase. This is motivated by the insight that the letter case does not contribute to the meanings of most words; sandwich and Sandwich have the same meaning in most contexts. Capitalization may indicate that a word is at the beginning of a sentence, but the bag-of-words model has already discarded all information from word order and grammar.

A second strategy is to remove words that are common to most of the documents in the corpus. These words, called stop words, include determiners such as the, a, and an; auxiliary verbs such as do, be, and will; and prepositions such as on, around, and beneath. Stop words are often functional words that contribute to the document's meaning through grammar rather than their denotations. The CountVectorizer class can filter stop words provided as the stop_words keyword argument and also includes a basic English stop list. Let's recreate the feature vectors for our documents using stop filtering:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
>>>     'UNC played Duke in basketball',
>>>     'Duke lost the basketball game',
>>>     'I ate a sandwich'
>>> ]
>>> vectorizer = CountVectorizer(stop_words='english')
>>> print vectorizer.fit_transform(corpus).todense()
>>> print vectorizer.vocabulary_
[[0 1 1 0 0 1 0 1]
 [0 1 1 1 1 0 0 0]
 [1 0 0 0 0 0 1 0]]
{u'duke': 2, u'basketball': 1, u'lost': 4, u'played': 5, u'game': 3, u'sandwich': 6, u'unc': 7, u'ate': 0}

The feature vectors have now fewer dimensions, and the first two document vectors are still more similar to each other than they are to the third document.

Stemming and lemmatization

While stop filtering is an easy strategy for dimensionality reduction, most stop lists contain only a few hundred words. A large corpus may still have hundreds of thousands of unique words after filtering. Two similar strategies for further reducing dimensionality are called stemming and lemmatization.

A high-dimensional document vector may separately encode several derived or inflected forms of the same word. For example, jumping and jumps are both forms of the word jump; a document vector in a corpus of long-jumping articles may encode each inflected form with a separate element in the feature vector. Stemming and lemmatization are two strategies to condense inflected and derived forms of a word into a single feature.

Let's consider another toy corpus with two documents:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
>>>     'He ate the sandwiches',
>>>     'Every sandwich was eaten by him'
>>> ]
>>> vectorizer = CountVectorizer(binary=True, stop_words='english')
>>> print vectorizer.fit_transform(corpus).todense()
>>> print vectorizer.vocabulary_
[[1 0 0 1]
 [0 1 1 0]]
{u'sandwich': 2, u'ate': 0, u'sandwiches': 3, u'eaten': 1}

The documents have similar meanings, but their feature vectors have no elements in common. Both documents contain a conjugation of ate and an inflected form of sandwich. Ideally, these similarities should be reflected in the feature vectors. Lemmatization is the process of determining the lemma, or the morphological root, of an inflected word based on its context. Lemmas are the base forms of words that are used to key the word in a dictionary. Stemming has a similar goal to lemmatization, but it does not attempt to produce the morphological roots of words. Instead, stemming removes all patterns of characters that appear to be affixes, resulting in a token that is not necessarily a valid word. Lemmatization frequently requires a lexical resource, like WordNet, and the word's part of speech. Stemming algorithms frequently use rules instead of lexical resources to produce stems and can operate on any token, even without its context.

Let's consider lemmatization of the word gathering in two documents:

corpus = [
    'I am gathering ingredients for the sandwich.',
    'There were many wizards at the gathering.'
]

In the first sentence gathering is a verb, and its lemma is gather. In the second sentence gathering is a noun, and its lemma is gathering. We will use the Natural Language Tool Kit (NTLK) to stem and lemmatize the corpus. NLTK can be installed using the instructions at http://www.nltk.org/install.html. After installation, execute the following code:

>>> import nltk
>>> nltk.download()

Then follow the instructions to download the corpora for NLTK.

Using the parts of speech of gathering, NLTK's WordNetLemmatizer correctly lemmatizes the words in both documents as shown in the following example:

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> print lemmatizer.lemmatize('gathering', 'v')
>>> print lemmatizer.lemmatize('gathering', 'n')
gather
gathering

Let's compare lemmatization with stemming. The Porter stemmer cannot consider the inflected form's part of speech and returns gather for both documents:

>>> from nltk.stem import PorterStemmer
>>> stemmer = PorterStemmer()
>>> print stemmer.stem('gathering')
gather

Now let's lemmatize our toy corpus:

>>> from nltk import word_tokenize
>>> from nltk.stem import PorterStemmer
>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> from nltk import pos_tag
>>> wordnet_tags = ['n', 'v']
>>> corpus = [
>>>     'He ate the sandwiches',
>>>     'Every sandwich was eaten by him'
>>> ]
>>> stemmer = PorterStemmer()
>>> print 'Stemmed:', [[stemmer.stem(token) for token in word_tokenize(document)] for document in corpus]
>>> def lemmatize(token, tag):
>>>     if tag[0].lower() in ['n', 'v']:
>>>         return lemmatizer.lemmatize(token, tag[0].lower())
>>>     return token
>>> lemmatizer = WordNetLemmatizer()
>>> tagged_corpus = [pos_tag(word_tokenize(document)) for document in corpus]
>>> print 'Lemmatized:', [[lemmatize(token, tag) for token, tag in document] for document in tagged_corpus]
Stemmed: [['He', 'ate', 'the', 'sandwich'], ['Everi', 'sandwich', 'wa', 'eaten', 'by', 'him']]
Lemmatized: [['He', 'eat', 'the', 'sandwich'], ['Every', 'sandwich', 'be', 'eat', 'by', 'him']]

Through stemming and lemmatization, we reduced the dimensionality of our feature space. We produced feature representations that more effectively encode the meanings of the documents despite the fact that the words in the corpus's dictionary are inflected differently in the sentences.

Extending bag-of-words with TF-IDF weights

In the previous section we used the bag-of-words representation to create feature vectors that encode whether or not a word from the corpus's dictionary appears in a document. These feature vectors do not encode grammar, word order, or the frequencies of words. It is intuitive that the frequency with which a word appears in a document could indicate the extent to which a document pertains to that word. A long document that contains one occurrence of a word may discuss an entirely different topic than a document that contains many occurrences of the same word. In this section, we will create feature vectors that encode the frequencies of words, and discuss strategies to mitigate two problems caused by encoding term frequencies.

Instead of using a binary value for each element in the feature vector, we will now use an integer that represents the number of times that the words appeared in the document.

We will use the following corpus. With stop word filtering, the corpus is represented by the following feature vector:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = ['The dog ate a sandwich, the wizard transfigured a sandwich, and I ate a sandwich']
>>> vectorizer = CountVectorizer(stop_words='english')
>>> print vectorizer.fit_transform(corpus).todense()
[[2 1 3 1 1]]
{u'sandwich': 2, u'wizard': 4, u'dog': 1, u'transfigured': 3, u'ate': 0}

The element for dog is now set to 1 and the element for sandwich is set to 2 to indicate that the corresponding words occurred once and twice, respectively. Note that the binary keyword argument of CountVectorizer is omitted; its default value is False, which causes it to return raw term frequencies rather than binary frequencies. Encoding the terms' raw frequencies in the feature vector provides additional information about the meanings of the documents, but assumes that all of the documents are of similar lengths.

Many words might appear with the same frequency in two documents, but the documents could still be dissimilar if one document is many times larger than the other. scikit-learn's TfdfTransformer object can mitigate this problem by transforming a matrix of term frequency vectors into a matrix of normalized term frequency weights. By default, TfdfTransformer smoothes the raw counts and applies L2 normalization. The smoothed, normalized term frequencies are given by the following equation:

Extending bag-of-words with TF-IDF weights

Extending bag-of-words with TF-IDF weights is the frequency of term Extending bag-of-words with TF-IDF weights in document Extending bag-of-words with TF-IDF weights and Extending bag-of-words with TF-IDF weights is the L2 norm of the count vector. In addition to normalizing raw term counts, we can improve our feature vectors by calculating logarithmically scaled term frequencies, which scale the counts to a more limited range, or augmented term frequencies, which further mitigates the bias for longer documents. Logarithmically scaled term frequencies are given by the following equation:

Extending bag-of-words with TF-IDF weights

The TfdfTransformer object calculates logarithmically scaled term frequencies when its sublinear_tf keyword argument is set to True. Augmented frequencies are given by the following equation:

Extending bag-of-words with TF-IDF weights

Extending bag-of-words with TF-IDF weights is the greatest frequency of all of the words in document Extending bag-of-words with TF-IDF weights. scikit-learn 0.15.2 does not implement augmented term frequencies, but the output of CountVectorizer can be easily transformed.

Normalization, logarithmically scaled term frequencies, and augmented term frequencies can represent the frequencies of terms in a document while mitigating the effects of different document sizes. However, another problem remains with these representations. The feature vectors contain large weights for terms that occur frequently in a document, even if those terms occur frequently in most documents in the corpus. These terms do not help to represent the meaning of a particular document relative to the rest of the corpus. For example, most of the documents in a corpus of articles about Duke's basketball team could include the words basketball, Coach K, and flop. These words can be thought of as corpus-specific stop words and may not be useful to calculate the similarity of documents. The inverse document frequency (IDF) is a measure of how rare or common a word is in a corpus. The inverse document frequency is given by the following equation:

Extending bag-of-words with TF-IDF weights

Here, Extending bag-of-words with TF-IDF weights is the total number of documents in the corpus and Extending bag-of-words with TF-IDF weights is the number of documents in the corpus that contain the term Extending bag-of-words with TF-IDF weights. A term's TF-IDF value is the product of its term frequency and inverse document frequency. TfidfTransformer returns TF-IDF's weight when its use_idf keyword argument is set to its default value, True. Since TF-IDF weighted feature vectors are commonly used to represent text, scikit-learn provides a TfidfVectorizer class that wraps CountVectorizer and TfidfTransformer. Let's use TfidfVectorizer to create TF-IDF weighted feature vectors for our corpus:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
>>>     'The dog ate a sandwich and I ate a sandwich',
>>>     'The wizard transfigured a sandwich'
>>> ]
>>> vectorizer = TfidfVectorizer(stop_words='english')
>>> print vectorizer.fit_transform(corpus).todense()
[[ 0.75458397  0.37729199  0.53689271  0.          0.        ]
 [ 0.          0.          0.44943642  0.6316672   0.6316672 ]]

By comparing the TF-IDF weights to the raw term frequencies, we can see that words that are common to many of the documents in the corpus, such as sandwich, have been penalized.

Space-efficient feature vectorizing with the hashing trick

In this chapter's previous examples, a dictionary containing all of the corpus's unique tokens is used to map a document's tokens to the elements of a feature vector. Creating this dictionary has two drawbacks. First, two passes are required over the corpus: the first pass is used to create the dictionary and the second pass is used to create feature vectors for the documents. Second, the dictionary must be stored in memory, which could be prohibitive for large corpora. It is possible to avoid creating this dictionary through applying a hash function to the token to determine its index in the feature vector directly. This shortcut is called the hashing trick. The following example uses HashingVectorizer to demonstrate the hashing trick:

>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> corpus = ['the', 'ate', 'bacon', 'cat']
>>> vectorizer = HashingVectorizer(n_features=6)
>>> print vectorizer.transform(corpus).todense()
[[-1.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0. -1.  0.]
 [ 0.  1.  0.  0.  0.  0.]]

The hashing trick is stateless. It can be used to create feature vectors in both parallel and online, or streaming, applications because it does not require an initial pass over the corpus.. Note that n_features is an optional keyword argument. Its default value, Space-efficient feature vectorizing with the hashing trick, is adequate for most problems; it is set to 6 here so that the matrix will be small enough to print and still display all of the nonzero features. Also, note that some of the term frequencies are negative. Since hash collisions are possible, HashingVectorizer uses a signed hash function. The value of a feature takes the same sign as its token's hash; if the term cats appears twice in a document and is hashed to -3, the fourth element of the document's feature vector will be decremented by two. If the term dogs also appears twice and is hashed to 3, the fourth element of the feature vector will be incremented by two. Using a signed hash function creates the possibility that errors from hash collisions will cancel each other out rather than accumulate; a loss of information is preferable to a loss of information and the addition of spurious information. Another disadvantage of the hashing trick is that the resulting model is more difficult to inspect, as the hashing function cannot recall what input token is mapped to each element of the feature vector.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset