Normalizing word count vectors

We will have to extend dist_raw to calculate the vector distance not on the raw vectors but on the normalized ones instead:

def dist_norm(v1, v2): 
    v1_normalized = v1 / scipy.linalg.norm(v1.toarray()) 
    v2_normalized = v2 / scipy.linalg.norm(v2.toarray()) 
    delta = v1_normalized - v2_normalized  
    return scipy.linalg.norm(delta.toarray())

This leads to the following similarity measurement, when being executed with best_post(X_train, new_post_vec, dist_norm):

    === Post 0 with dist=1.41:
        'This is a toy post about machine learning. Actually, it contains not much interesting stuff.'
    === Post 1 with dist=0.86:
        'Imaging databases provide storage capabilities.'
    === Post 2 with dist=0.92:
        'Most imaging databases save images permanently.
    '
    === Post 3 with dist=0.77:
        'Imaging databases store data.'
    === Post 4 with dist=0.77:
        'Imaging databases store data. Imaging databases store data. Imaging databases store data.'
    
    ==> Best post is 3 with dist=0.77

This looks a bit better now. Post 3 and Post 4 are calculated as being equally similar. One could argue whether that much repetition would be a delight to the reader, but in terms of counting the words in the posts this seems to be right.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset