In this we will learn how represent the text as a bag of words.
In order to do machine learning on text, we will need to convert the text to numerical feature vectors. In this section, we will look into the bag of words representation, where the text is converted to numerical vectors and the column names are the underlying words and values can be either of thw following points:
Bag of words is the most frequent way of representing the text. As the name suggests, the order of words is ignored and only the presence/absence of words are key to this representation. It is a two-step process, as follows:
Let's load the necessary libraries and prepare the dataset for the demonstration of bag of words:
# Load Libraries from nltk.tokenize import sent_tokenize from sklearn.feature_extraction.text import CountVectorizer from nltk.corpus import stopwords # 1. Our input text, we use the same input which we had used in stop word removal recipe. text = "Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.A typical application is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted."
Let's jump into how to transform the text into a bag of words representation:
#2.Let us divide the given text into sentences sentences = sent_tokenize(text) #3.Let us write the code to generate feature vectors. count_v = CountVectorizer() tdm = count_v.fit_transform(sentences) # While creating a mapping from words to feature indices, we can ignore # some words by providing a stop word list. stop_words = stopwords.words('english') count_v_sw = CountVectorizer(stop_words=stop_words) sw_tdm = count_v.fit_transform(sentences) # Use ngrams count_v_ngram = CountVectorizer(stop_words=stop_words,ngram_range=(1,2)) ngram_tdm = count_v.fit_transform(sentences)
In step 1, we will define the input. This is the same input that we used for the stop word removal recipe. In step 2, we will import the sentence tokenizer and tokenize the given input into sentences. We will treat every sentence here as a document:
Depending on your application, the notion of a document can change. In this case, our sentence is considered as a document. In some cases, we can also treat a paragraph as a document. In web page mining, a single web page can be treated as a document or parts of the web page separated by the <p> tags can also be treated as a document.
>>> len(sentences) 6 >>>
If we print the length of the sentence list, we will get six, and so in our case, we have six documents.
In step 3, we will import CountVectorizer
from the scikitlearn.feature_extraction
text package. It converts a collection of documents—in this case, a list of sentences—to a matrix, where the rows are sentences and the columns are the words in these sentences. The count of these words are inserted in the value of these cells.
We will transform the list of sentences into a term document matrix using CountVectorizer
. Let's dissect the output one by one. First, we will look into count_v
, which is a CountVectorizer
object. We had mentioned in the introduction that we need to build a dictionary of all the words in the given text. The vocabulary_
of count_v
attribute provides us with the list of words and their associated IDs or feature indices:
This dictionary can be retrieved using the vocabulary_
attribute. This is a map of the terms in order to feature indices. We can also use the following function to get the list of words (features):
>>> count_v.get_feature_names()
Let's now move on to look at tdm
, which is the object that we received after transforming the given input using CountVectorizer:
>>> type(tdm) <class 'scipy.sparse.csr.csr_matrix'> >>>
As you can see, tdm is a sparse matrix object. Refer to the following link to understand more about the sparse matrix representation:
http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html
We can look into the shape of this object and also inspect some of the elements, as follows:
We can see that the shape of the matrix is 6 X 122. We have six documents, that is, sentences in our context and 122 words that form the vocabulary. Note that this is a sparse matrix representation; as all the sentences will not have all the words, a lot of the cell values will have zero as an entry and hence, we will print only the indices that have non-zero entries.
From tdm.indptr
, we know that document 1's entry starts from 0
and ends at 18 in the tdm.data
and tdm.indices
arrays, as follows:
>>> tdm.data[0:17] array([4, 2, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64) >>> tdm.indices[0:17] array([107, 60, 2, 83, 110, 9, 17, 90, 28, 5, 84, 108, 77, 67, 20, 40, 81]) >>>
We can verify this in the following way:
>>> count_v.get_feature_names()[107] u'text' >>> count_v.get_feature_names()[60] u'mining'
We can see that 107
, which corresponds to the word text, has occurred four times in the first sentence, and similarly, mining has occurred once. Thus, in this recipe, we converted a given text into a feature vector, where the features are words.
The CountVectorizer
class has a lot of other features to offer in order to transform the text into feature vectors. Let's look at some of them:
>>> count_v.get_params() {'binary': False, 'lowercase': True, 'stop_words': None, 'vocabulary': None, 'tokenizer': None, 'decode_error': u'strict', 'dtype': <type 'numpy.int64'>, 'charset_error': None, 'charset': None, 'analyzer': u'word', 'encoding': u'utf-8', 'ngram_range': (1, 1), 'max_df': 1.0, 'min_df': 1, 'max_features': None, 'input': u'content', 'strip_accents': None, 'token_pattern': u'(?u)\b\w\w+\b', 'preprocessor': None} >>>
The first one is binary, which is set to False
; we can also have it set to True
. Then, the final matrix would not have the count but will have one or zero, based on the presence or absence of the word in the document.
The lowercase is set to True
by default; the input text is transformed into lowercase before the mapping of the words to feature indices is performed.
While creating a mapping of the words to feature indices, we can ignore some words by providing a stop word list. Observe the following example:
from nltk.corpus import stopwords stop_words = stopwords.words('english') count_v = CountVectorizer(stop_words=stop_words) sw_tdm = count_v.fit_transform(sentences)
If we print the size of the vocabulary that has been built, we can see the following:
>>> len(count_v_sw.vocabulary_) 106 >>>
We can see that we have 106 now as compared to 122 that we had before.
We can also give a fixed set of vocabulary to CountVectorizer
. The final sparse matrix columns will be only from these fixed sets and anything that is not in this set will be ignored.
The next interesting parameter is the ngram range. You can see that a tuple (1,1) has been passed. This ensures that only one grams or single words are used while creating a feature set. For example, this can be changed to (1,2), which tells CountVectorizer
to create both unigrams and bigrams. Let's look at the following code and the output:
count_v_ngram = CountVectorizer(stop_words=stop_words,ngram_range=(1,2)) ngram_tdm = count_v.fit_transform(sentences)
Both the unigrams and bigrams are now a part of our feature set.
I will leave you to explore the other parameters. The documentation for these parameters is available at the following link: