We will have to extend dist_raw to calculate the vector distance not on the raw vectors but on the normalized ones instead:
def dist_norm(v1, v2): v1_normalized = v1 / scipy.linalg.norm(v1.toarray()) v2_normalized = v2 / scipy.linalg.norm(v2.toarray()) delta = v1_normalized - v2_normalized return scipy.linalg.norm(delta.toarray())
This leads to the following similarity measurement, when being executed with best_post(X_train, new_post_vec, dist_norm):
=== Post 0 with dist=1.41: 'This is a toy post about machine learning. Actually, it contains not much interesting stuff.' === Post 1 with dist=0.86: 'Imaging databases provide storage capabilities.' === Post 2 with dist=0.92: 'Most imaging databases save images permanently. ' === Post 3 with dist=0.77: 'Imaging databases store data.' === Post 4 with dist=0.77: 'Imaging databases store data. Imaging databases store data. Imaging databases store data.' ==> Best post is 3 with dist=0.77
This looks a bit better now. Post 3 and Post 4 are calculated as being equally similar. One could argue whether that much repetition would be a delight to the reader, but in terms of counting the words in the posts this seems to be right.