Representing the text as a bag of words

In this we will learn how represent the text as a bag of words.

Getting ready

In order to do machine learning on text, we will need to convert the text to numerical feature vectors. In this section, we will look into the bag of words representation, where the text is converted to numerical vectors and the column names are the underlying words and values can be either of thw following points:

  • Binary, which indicates whether the word is present/absent in the given document
  • Frequency, which indicates the count of the word in the given document
  • TFIDF, which is a score that we will cover subsequently

Bag of words is the most frequent way of representing the text. As the name suggests, the order of words is ignored and only the presence/absence of words are key to this representation. It is a two-step process, as follows:

  1. For every word in the document that is present in the training set, we will assign an integer and store this as a dictionary.
  2. For every document, we will create a vector. The columns of the vectors are the actual words itself. They form the features. The values of the cell are binary, frequency, or TFIDF.

How to do it…

Let's load the necessary libraries and prepare the dataset for the demonstration of bag of words:

# Load Libraries
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

# 1.	Our input text, we use the same input which we had used in stop word removal recipe.
text = "Text mining, also referred to as text data mining, roughly equivalent to text analytics,
refers to the process of deriving high-quality information from text. High-quality information is 
typically derived through the devising of patterns and trends through means such as statistical 
pattern learning. Text mining usually involves the process of structuring the input text 
(usually parsing, along with the addition of some derived linguistic features and the removal 
of others, and subsequent insertion into a database), deriving patterns within the structured data, 
and finally evaluation and interpretation of the output. 'High quality' in text mining usually 
refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks 
include text categorization, text clustering, concept/entity extraction, production of granular 
taxonomies, sentiment analysis, document summarization, and entity relation modeling 
(i.e., learning relations between named entities).Text analysis involves information retrieval, 
lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, 
information extraction, data mining techniques including link and association analysis, 
visualization, and predictive analytics. The overarching goal is, essentially, to turn text 
into data for analysis, via application of natural language processing (NLP) and analytical 
methods.A typical application is to scan a set of documents written in a natural language and 
either model the document set for predictive classification purposes or populate a database 
or search index with the information extracted."

Let's jump into how to transform the text into a bag of words representation:

#2.Let us divide the given text into sentences
sentences = sent_tokenize(text)

#3.Let us write the code to generate feature vectors.
count_v = CountVectorizer()
tdm = count_v.fit_transform(sentences)


# While creating a mapping from words to feature indices, we can ignore
# some words by providing a stop word list.
stop_words = stopwords.words('english')
count_v_sw = CountVectorizer(stop_words=stop_words)
sw_tdm = count_v.fit_transform(sentences)


# Use ngrams
count_v_ngram = CountVectorizer(stop_words=stop_words,ngram_range=(1,2))
ngram_tdm = count_v.fit_transform(sentences)

How it works…

In step 1, we will define the input. This is the same input that we used for the stop word removal recipe. In step 2, we will import the sentence tokenizer and tokenize the given input into sentences. We will treat every sentence here as a document:

Tip

Depending on your application, the notion of a document can change. In this case, our sentence is considered as a document. In some cases, we can also treat a paragraph as a document. In web page mining, a single web page can be treated as a document or parts of the web page separated by the <p> tags can also be treated as a document.

>>> len(sentences)
6
>>>

If we print the length of the sentence list, we will get six, and so in our case, we have six documents.

In step 3, we will import CountVectorizer from the scikitlearn.feature_extraction text package. It converts a collection of documents—in this case, a list of sentences—to a matrix, where the rows are sentences and the columns are the words in these sentences. The count of these words are inserted in the value of these cells.

We will transform the list of sentences into a term document matrix using CountVectorizer. Let's dissect the output one by one. First, we will look into count_v, which is a CountVectorizer object. We had mentioned in the introduction that we need to build a dictionary of all the words in the given text. The vocabulary_ of count_v attribute provides us with the list of words and their associated IDs or feature indices:

How it works…

This dictionary can be retrieved using the vocabulary_ attribute. This is a map of the terms in order to feature indices. We can also use the following function to get the list of words (features):

>>> count_v.get_feature_names()

Let's now move on to look at tdm, which is the object that we received after transforming the given input using CountVectorizer:

>>> type(tdm)
<class 'scipy.sparse.csr.csr_matrix'>
>>>

As you can see, tdm is a sparse matrix object. Refer to the following link to understand more about the sparse matrix representation:

http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html

We can look into the shape of this object and also inspect some of the elements, as follows:

How it works…

We can see that the shape of the matrix is 6 X 122. We have six documents, that is, sentences in our context and 122 words that form the vocabulary. Note that this is a sparse matrix representation; as all the sentences will not have all the words, a lot of the cell values will have zero as an entry and hence, we will print only the indices that have non-zero entries.

From tdm.indptr, we know that document 1's entry starts from 0 and ends at 18 in the tdm.data and tdm.indices arrays, as follows:

>>> tdm.data[0:17]
array([4, 2, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)
>>> tdm.indices[0:17]
array([107,  60,   2,  83, 110,   9,  17,  90,  28,   5,  84, 108,  77,
        67,  20,  40,  81])
>>>

We can verify this in the following way:

>>> count_v.get_feature_names()[107]
u'text'
>>> count_v.get_feature_names()[60]
u'mining'

We can see that 107, which corresponds to the word text, has occurred four times in the first sentence, and similarly, mining has occurred once. Thus, in this recipe, we converted a given text into a feature vector, where the features are words.

There's more…

The CountVectorizer class has a lot of other features to offer in order to transform the text into feature vectors. Let's look at some of them:

>>> count_v.get_params()
{'binary': False, 'lowercase': True, 'stop_words': None, 'vocabulary': None, 'tokenizer': None, 'decode_error': u'strict', 'dtype': <type 'numpy.int64'>, 'charset_error': None, 'charset': None, 'analyzer': u'word', 'encoding': u'utf-8', 'ngram_range': (1, 1), 'max_df': 1.0, 'min_df': 1, 'max_features': None, 'input': u'content', 'strip_accents': None, 'token_pattern': u'(?u)\b\w\w+\b', 'preprocessor': None}
>>>	

The first one is binary, which is set to False; we can also have it set to True. Then, the final matrix would not have the count but will have one or zero, based on the presence or absence of the word in the document.

The lowercase is set to True by default; the input text is transformed into lowercase before the mapping of the words to feature indices is performed.

While creating a mapping of the words to feature indices, we can ignore some words by providing a stop word list. Observe the following example:

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

count_v = CountVectorizer(stop_words=stop_words)
sw_tdm = count_v.fit_transform(sentences)

If we print the size of the vocabulary that has been built, we can see the following:

>>> len(count_v_sw.vocabulary_)
106
>>>

We can see that we have 106 now as compared to 122 that we had before.

We can also give a fixed set of vocabulary to CountVectorizer. The final sparse matrix columns will be only from these fixed sets and anything that is not in this set will be ignored.

The next interesting parameter is the ngram range. You can see that a tuple (1,1) has been passed. This ensures that only one grams or single words are used while creating a feature set. For example, this can be changed to (1,2), which tells CountVectorizer to create both unigrams and bigrams. Let's look at the following code and the output:

count_v_ngram = CountVectorizer(stop_words=stop_words,ngram_range=(1,2))
ngram_tdm = count_v.fit_transform(sentences)

Both the unigrams and bigrams are now a part of our feature set.

I will leave you to explore the other parameters. The documentation for these parameters is available at the following link:

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

See also

  • Using Dictionaries recipe in Chapter 1, Using Python for Data Science
  • Removing Stop words, Stemming of words, Lemmatization of words recipe in Chapter 3, Analyzing Data - Explore & Wrangle
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset