Similarity measures

There are many similarity measures that can be used for performing NLP tasks. The nltk.metrics package in NLTK is used to provide various evaluation or similarity measures, which is conducive to perform various NLP tasks.

In order to test the performance of taggers, chunkers, and so on, in NLP, the standard scores retrieved from information retrieval can be used.

Let's have a look at how the output of named entity recognizer can be analyzed using the standard scores obtained from a training file:

>>> from __future__ import print_function
>>> from nltk.metrics import *
>>> training='PERSON OTHER PERSON OTHER OTHER ORGANIZATION'.split()
>>> testing='PERSON OTHER OTHER OTHER OTHER OTHER'.split()
>>> print(accuracy(training,testing))
0.6666666666666666
>>> trainset=set(training)
>>> testset=set(testing)
>>> precision(trainset,testset)
1.0
>>> print(recall(trainset,testset))
0.6666666666666666
>>> print(f_measure(trainset,testset))
0.8

Applying similarity measures using Ethe edit distance algorithm

Edit distance or the Levenshtein edit distance between two strings is used to compute the number of characters that can be inserted, substituted, or deleted in order to make two strings equal.

The operations performed in Edit Distance include the following:

  • Copying letters from the first string to the second string (cost 0) and substituting a letter with another (cost 1):

    D(i-1,j-1) + d(si,tj)(Substitution / copy)

  • Deleting a letter in the first string (cost 1)

    D(i,j-1)+1 (deletion)

  • Inserting a letter in the second string (cost 1):

    D(i,j) = min D(i-1,j)+1 (insertion)

The Python code for Edit Distance that is included in the nltk.metrics package is as follows:

from __future__ import print_function
def _edit_dist_init(len1, len2):
    lev = []
    for i in range(len1):
        lev.append([0] * len2)  # initialize 2D array to zero
    for i in range(len1):
        lev[i][0] = i           # column 0: 0,1,2,3,4,...
    for j in range(len2):
        lev[0][j] = j           # row 0: 0,1,2,3,4,...
    return lev

def_edit_dist_step(lev,i,j,s1,s2,transpositions=False):
c1 =s1[i-1]
c2 =s2[j-1]

# skipping a character in s1
a =lev[i-1][j] +1
# skipping a character in s2
b =lev[i][j -1]+1
# substitution
c =lev[i-1][j-1]+(c1!=c2)
# transposition
d =c+1 # never picked by default
if transpositions and i>1 and j>1:
if s1[i -2]==c2 and s2[j -2]==c1:
d =lev[i-2][j-2]+1
# pick the cheapest
lev[i][j] =min(a,b,c,d)

def edit_distance(s1, s2, transpositions=False):
    # set up a 2-D array
    len1 = len(s1)
    len2 = len(s2)
    lev = _edit_dist_init(len1 + 1, len2 + 1)

    # iterate over the array
    for i in range(len1):
        for j in range(len2):
            _edit_dist_step(lev, i + 1, j + 1, s1, s2, transpositions=transpositions)
    return lev[len1][len2]

Let's have a look at the Edit Distance calculated in NLTK using the nltk.metrics package:

>>> import nltk
>>> from nltk.metrics import *
>>> edit_distance("relate","relation")
3
>>> edit_distance("suggestion","calculation")
7

Here, when we calculate the edit distance between relate and relation, three operations (one substitution and two insertions) are performed. While calculating the edit distance between suggestion and calculation, seven operations (six substitutions and one insertion) are performed.

Applying similarity measures using Jaccard's Coefficient

Jaccard's coefficient, or Tanimoto coefficient, may be defined as a measure of the overlap of two sets, X and Y.

It may be defined as follows:

  • Jaccard(X,Y)=|X∩Y|/|XUY|
  • Jaccard(X,X)=1
  • Jaccard(X,Y)=0 if X∩Y=0

The code for Jaccard's similarity may be given as follows:

def jacc_similarity(query, document):
first=set(query).intersection(set(document))
second=set(query).union(set(document))
return len(first)/len(second)

Let's have a look at the implementation of Jaccard's similarity coefficient using NLTK:

>>> import nltk
>>> from nltk.metrics import *
>>> X=set([10,20,30,40])
>>> Y=set([20,30,60])
>>> print(jaccard_distance(X,Y))
0.6

Applying similarity measures using the Smith Waterman distance

The Smith Waterman distance is similar to edit distance. This similarity metric was developed in order to detect the optical alignments between related protein sequences and DNA. It consists of costs to be assigned to and functions for alphabet mapping to cost values (substitution); cost is also assigned to gap G (insertion or deletion):

  1. 0 //start over
  2. D(i-1,j-1) -d(si,tj) //subst/copy
  3. D(i,j) = max D(i-1,j)-G //insert
  4. D(i,j-1)-G //delete

    Note

    Distance is maximum over all i,j in table of D(i,j)

  5. G = 1 //example value for gap
  6. d(c,c) = -2 //context dependent substitution cost
  7. d(c,d) = +1 //context dependent substitution cost

Similar to Edit distance, the Python code for Smith Waterman can be embedded with the nltk.metrics package to perform string similarity using Smith Waterman in NLTK.

Other string similarity metrics

Binary distance is a string similarity metric. It returns the value 0.0 if two labels are identical; otherwise, it returns the value 1.0.

The Python code for Binary distance metrics is:

def binary_distance(label1, label2):
 return 0.0 if label1 == label2 else 1.0

Let's see how Binary distance metrics is implemented in NLTK:

>>> import nltk
>>> from nltk.metrics import *
>>> X = set([10,20,30,40])
>>> Y= set([30,50,70])
>>> binary_distance(X, Y)
1.0

Masi distance is based on partial agreement when multiple labels are present.

The Python code included in nltk.metrics for masi distance is as follows:

def masi_distance(label1, label2):
    len_intersection = len(label1.intersection(label2))
    len_union = len(label1.union(label2))
    len_label1 = len(label1)
    len_label2 = len(label2)
    if len_label1 == len_label2 and len_label1 == len_intersection:
        m = 1
    elif len_intersection == min(len_label1, len_label2):
        m = 0.67
    elif len_intersection > 0:
        m = 0.33
    else:
        m = 0

    return 1 - (len_intersection / float(len_union)) * m

Let's see the implementation of masi distance in NLTK:

>>> import nltk
>>> from __future__ import print_function
>>> from nltk.metrics import *
>>> X = set([10,20,30,40])
>>> Y= set([30,50,70])
>>> print(masi_distance(X,Y))
0.945
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset