Converting raw text into a bag of words

We do not have to write custom code for counting words and representing those counts as a vector. Scikit's CountVectorizer method, does the job efficiently but also has a very convenient interface:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vectorizer = CountVectorizer(min_df=1)  

The min_df parameter determines how CountVectorizer treats seldom words (minimum document frequency). If it is set to an integer, all words occurring in fewer documents will be dropped. If it is a fraction, all words that occur in less than that fraction of the overall dataset will be dropped. The max_df parameter works in a similar manner. If we print the instance, we can see what other parameters scikit provides together with their default values:

>>> print(vectorizer)
CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)ww+', tokenizer=None, vocabulary=None)

We can see that, as expected, the counting is done at the word level (analyzer=word) and words are determined by the regular expression pattern token_pattern. It will, for example, split cross-validated into cross and validated. This process is also called tokenization.

Let's ignore the other parameters for now and consider the following two example subject lines:

>>> content = ["How to format my hard disk", 
               " Hard disk format  problems "]

We can now put this list of subject lines into the fit_transform() function of our vectorizer, which does all the hard vectorization work:

>>> X = vectorizer.fit_transform(content)
>>> vectorizer.get_feature_names()
['disk', 'format', 'hard', 'how', 'my', 'problems', 'to']

The vectorizer has detected seven words for which we can fetch the counts individually:

>>> print(X.toarray().transpose())
[[1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 0]
 [0 1]
 [1 0]]

This means that the first sentence contains all the words except problems, while the second contains all but how, my, and to. In fact, these are the same columns as we have seen in the preceding table. From X, we can extract a feature vector that we will use to compare two documents with each other.

We will start with a naïve approach first, to point out some preprocessing peculiarities we have to account for. So let's pick a random post, for which we then create the count vector. We will then compare its distance to all the count vectors and fetch the post with the smallest one.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset