Normalizing word count vectors

We will have to extend dist_raw to calculate the vector distance not on the raw vectors but on the normalized ones instead:

def dist_norm(v1, v2): 
    v1_normalized = v1 / scipy.linalg.norm(v1.toarray()) 
    v2_normalized = v2 / scipy.linalg.norm(v2.toarray()) 
    delta = v1_normalized - v2_normalized  
    return scipy.linalg.norm(delta.toarray())

This leads to the following similarity measurement, when being executed with best_post(X_train, new_post_vec, dist_norm):

    === Post 0 with dist=1.41:
        'This is a toy post about machine learning. Actually, it contains not much interesting stuff.'
    === Post 1 with dist=0.86:
        'Imaging databases provide storage capabilities.'
    === Post 2 with dist=0.92:
        'Most imaging databases save images permanently.
    '
    === Post 3 with dist=0.77:
        'Imaging databases store data.'
    === Post 4 with dist=0.77:
        'Imaging databases store data. Imaging databases store data. Imaging databases store data.'
    
    ==> Best post is 3 with dist=0.77

This looks a bit better now. Post 3 and Post 4 are calculated as being equally similar. One could argue whether that much repetition would be a delight to the reader, but in terms of counting the words in the posts this seems to be right.

Table of Contents for Normalizing word count vectors

Create new playlist

Sign In

Sign Up

Table of Contents for
Normalizing word count vectors