Text transformers

Now that we have our dataset, how are we going to perform data mining on it?

Text-based datasets include books, essays, websites, manuscripts, programming code, and other forms of written expression. All of the algorithms we have seen so far deal with numerical or categorical features, so how do we convert our text into a format that the algorithm can deal with?

There are a number of measurements that could be taken. For instance, average word and average sentence length are used to predict the readability of a document. However, there are lots of feature types such as word occurrence which we will now investigate.

Bag-of-words

One of the simplest but highly effective models is to simply count each word in the dataset. We create a matrix, where each row represents a document in our dataset and each column represents a word. The value of the cell is the frequency of that word in the document.

Here's an excerpt from The Lord of the Rings, J.R.R. Tolkien:

 

Three Rings for the Elven-kings under the sky,

Seven for the Dwarf-lords in halls of stone,

Nine for Mortal Men, doomed to die,

One for the Dark Lord on his dark throne

In the Land of Mordor where the Shadows lie.

One Ring to rule them all, One Ring to find them,

One Ring to bring them all and in the darkness bind them.

In the Land of Mordor where the Shadows lie.

 
 --J.R.R. Tolkien's epigraph to The Lord of The Rings

The word the appears nine times in this quote, while the words in, for, to, and one each appear four times. The word ring appears three times, as does the word of.

We can create a dataset from this, choosing a subset of words and counting the frequency:

Word

the

one

ring

to

Frequency

9

4

3

4

We can use the counter class to do a simple count for a given string. When counting words, it is normal to convert all letters to lowercase, which we do when creating the string. The code is as follows:

s = """Three Rings for the Elven-kings under the sky,
Seven for the Dwarf-lords in halls of stone,
Nine for Mortal Men, doomed to die,
One for the Dark Lord on his dark throne
In the Land of Mordor where the Shadows lie.
One Ring to rule them all, One Ring to find them,
One Ring to bring them all and in the darkness bind them.
In the Land of Mordor where the Shadows lie. """.lower()
words = s.split()
from collections import Counter
c = Counter(words)

Printing c.most_common(5) gives the list of the top five most frequently occurring words. Ties are not handled well as only five are given and a very large number of words all share a tie for fifth place.

The bag-of-words model has three major types. The first is to use the raw frequencies, as shown in the preceding example. This does have a drawback when documents vary in size from fewer words to many words, as the overall values will be very different. The second model is to use the normalized frequency, where each document's sum equals 1. This is a much better solution as the length of the document doesn't matter as much. The third type is to simply use binary features—a value is 1 if the word occurs at all and 0 if it doesn't. We will use binary representation in this chapter.

Another popular (arguably more popular) method for performing normalization is called term frequency - inverse document frequency, or tf-idf. In this weighting scheme, term counts are first normalized to frequencies and then divided by the number of documents in which it appears in the corpus. We will use tf-idf in Chapter 10, Clustering News Articles.

There are a number of libraries for working with text data in Python. We will use a major one, called Natural Language ToolKit (NLTK). The scikit-learn library also has the CountVectorizer class that performs a similar action, and it is recommended you take a look at it (we will use it in Chapter 9, Authorship Attribution). However the NLTK version has more options for word tokenization. If you are doing natural language processing in python, NLTK is a great library to use.

N-grams

A step up from single bag-of-words features is that of n-grams. An n-gram is a subsequence of n consecutive tokens. In this context, a word n-gram is a set of n words that appear in a row.

They are counted the same way, with the n-grams forming a word that is put in the bag. The value of a cell in this dataset is the frequency that a particular n-gram appears in the given document.

Note

The value of n is a parameter. For English, setting it to between 2 to 5 is a good start, although some applications call for higher values.

As an example, for n=3, we extract the first few n-grams in the following quote:

Always look on the bright side of life.

The first n-gram (of size 3) is Always look on, the second is look on the, the third is on the bright. As you can see, the n-grams overlap and cover three words.

Word n-grams have advantages over using single words. This simple concept introduces some context to word use by considering its local environment, without a large overhead of understanding the language computationally. A disadvantage of using n-grams is that the matrix becomes even sparser—word n-grams are unlikely to appear twice (especially in tweets and other short documents!).

Specially for social media and other short documents, word n-grams are unlikely to appear in too many different tweets, unless it is a retweet. However, in larger documents, word n-grams are quite effective for many applications.

Another form of n-gram for text documents is that of a character n-gram. Rather than using sets of words, we simply use sets of characters (although character n-grams have lots of options for how they are computed!). This type of dataset can pick up words that are misspelled, as well as providing other benefits. We will test character n-grams in this chapter and see them again in Chapter 9, Authorship Attribution.

Other features

There are other features that can be extracted too. These include syntactic features, such as the usage of particular words in sentences. Part-of-speech tags are also popular for data mining applications that need to understand meaning in text. Such feature types won't be covered in this module. If you are interested in learning more, I recommend Python 3 Text Processing with NLTK 3 Cookbook, Jacob Perkins, Packt Publishing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset