Let's play with the toy dataset, consisting of the following posts:
Post filename
|
Post content
|
01.txt |
This is a toy post about machine learning. Actually, it contains not much interesting stuff |
02.txt |
Imaging databases can get huge |
03.txt |
Most imaging databases save images permanently |
04.txt |
Imaging databases store images |
05.txt |
Imaging databases store images |
In this post dataset, we want to find the most similar post for the short post imaging databases.
Assuming that the posts are located in the "data/toy" directory (please check the Jupyter notebook), we can feed CountVectorizer with it:
>>> from pathlib import Path # for easy path management >>> TOY_DIR = Path('data/toy') >>> posts = [] >>> for fn in TOY_DIR.iterdir(): ... with open(fn, 'r') as f: ... posts.append(f.read()) ... >>> from sklearn.feature_extraction.text import CountVectorizer >>> vectorizer = CountVectorizer(min_df=1)
We have to notify the vectorizer about the full dataset so that it knows upfront which words are to be expected:
>>> X_train = vectorizer.fit_transform(posts) >>> num_samples, num_features = X_train.shape >>> print("#samples: %d, #features: %d" % ... (num_samples, num_features)) #samples: 5, #features: 25
Unsurprisingly, we have five posts with a total of 25 different words. The following words that have been tokenized will be counted:
>>> print(vectorizer.get_feature_names())
['about', 'actually', 'capabilities', 'contains', 'data', 'databases', 'images', 'imaging', 'interesting', 'is', 'it', 'learning', 'machine', 'most', 'much', 'not', 'permanently', 'post', 'provide', 'save', 'storage', 'store', 'stuff', 'this', 'toy']
Now we can vectorize our new post:
>>> new_post = "imaging databases" >>> new_post_vec = vectorizer.transform([new_post])
Note that the count vectors returned by the transform method are sparse, which is the appropriate format because the data itself is also sparse. That is, each vector does not store one count value for each word, as most of those counts will be zero (the post does not contain the word). Instead, it uses the more memory-efficient implementation, coo_matrix (for coordinate). Our new post, for instance, actually contains only two elements:
>>> print(new_post_vec) (0, 7) 1 (0, 5) 1
Via its toarray() member, we can once again access fully ndarray:
>>> print(new_post_vec.toarray()) [[0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
We need to use the full array if we want to use it as a vector for similarity calculations. For the similarity measurement (the naïve one), we calculate the Euclidean distance between the count vectors of the new post and all the old posts:
import scipy def dist_raw(v1, v2): delta = v1-v2 return scipy.linalg.norm(delta.toarray())
The norm() function calculates the Euclidean norm (shortest distance). This is just one obvious first pick, and there are many more interesting ways to calculate the distance. Just take a look at the paper distance coefficients between two lists or sets in The Python papers source codes, in which Maurice Ling nicely presents 35 different ones.
With dist_raw, we just need to iterate over all the posts and remember the nearest one. As we will play with it throughout the book, let's define a convenience function that takes the current dataset and the new post in vectorized form as well as a distance function and prints out an analysis of how well the distance function works:
def best_post(X, new_vec, dist_func): best_doc = None best_dist = float('inf') # infinite value as a starting point best_i = None for i, post in enumerate(posts): if post == new_post: continue post_vec = X.getrow(i) d = dist_func(post_vec, new_vec) print("=== Post %i with dist=%.2f:n '%s'" % (i, d, post)) if d < best_dist: best_dist = d best_i = i print("n==> Best post is %i with dist=%.2f" % (best_i, best_dist))
When we execute as best_post(X_train, new_post_vec, dist_raw), we can see in the output the posts with their respective distance to the new post:
=== Post 0 with dist=4.00: 'This is a toy post about machine learning. Actually, it contains not much interesting stuff.' === Post 1 with dist=1.73: 'Imaging databases provide storage capabilities.' === Post 2 with dist=2.00: 'Most imaging databases save images permanently.' === Post 3 with dist=1.41: 'Imaging databases store data.' === Post 4 with dist=5.10: 'Imaging databases store data. Imaging databases store data. Imaging databases store data.' ==> Best post is 3 with dist=1.41
Congratulations, we have our first similarity measurement. Post 0 is most dissimilar from our new post. Quite understandably, it does not have a single word in common with the new post. We can also understand that Post 1 is very similar to the new post, but not the winner, as it contains one word more than Post 3, which is not contained in the new post.
Looking at Post 3 and Post 4, however, the picture is not so clear. Post 4 is the same as Post 3 duplicated three times. So, it should also be as similar to the new post as Post 3.
Printing the corresponding feature vectors explains why:
>>> print(X_train.getrow(3).toarray()) [[0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]] >>> print(X_train.getrow(4).toarray()) [[0 0 0 0 3 3 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0]]
Obviously, using only the counts of the raw words is insufficient. We will have to normalize them to get vectors of unit length.