How not to do it

One text similarity measure is the Levenshtein distance, which also goes by the name edit distance. Let's say we have two words, machine and mchiene. The similarity between them can be expressed as the minimum set of edits that are necessary to turn one word into the other. In this case, the edit distance will be two, as we have to add an a after the m and delete the first e. This algorithm is, however, quite costly as it is bound by the length of the first word times the length of the second word.

Looking at our posts, we could cheat by treating whole words as characters and performing the edit distance calculation on the word level. Let's say we have two posts called, how to format my hard disk, and hard disk format problems (and let us assume the post consists only of the title for simplicity's sake). We will need an edit distance of five because of removing, how, to, format, my, and then adding format and problems at the end. Thus, one could express the difference between two posts as the number of words that have to be added or deleted so that one text morphs into the other. Although we could speed up the overall approach quite a bit, the time complexity remains the same.

But even if it would have been fast enough, there is another problem. In the earlier post, the word format accounts for an edit distance of two, due to deleting it first, then adding it. So, our distance seems to be not robust enough to take word reordering into account.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset