There are many similarity measures that can be used for performing NLP tasks. The nltk.metrics
package in NLTK is used to provide various evaluation or similarity measures, which is conducive to perform various NLP tasks.
In order to test the performance of taggers, chunkers, and so on, in NLP, the standard scores retrieved from information retrieval can be used.
Let's have a look at how the output of named entity recognizer can be analyzed using the standard scores obtained from a training file:
>>> from __future__ import print_function >>> from nltk.metrics import * >>> training='PERSON OTHER PERSON OTHER OTHER ORGANIZATION'.split() >>> testing='PERSON OTHER OTHER OTHER OTHER OTHER'.split() >>> print(accuracy(training,testing)) 0.6666666666666666 >>> trainset=set(training) >>> testset=set(testing) >>> precision(trainset,testset) 1.0 >>> print(recall(trainset,testset)) 0.6666666666666666 >>> print(f_measure(trainset,testset)) 0.8
Edit distance or the Levenshtein edit distance between two strings is used to compute the number of characters that can be inserted, substituted, or deleted in order to make two strings equal.
The operations performed in Edit Distance include the following:
D(i-1,j-1) + d(si,tj)(Substitution / copy)
D(i,j-1)+1 (deletion)
D(i,j) = min D(i-1,j)+1 (insertion)
The Python code for Edit Distance that is included in the nltk.metrics
package is as follows:
from __future__ import print_function def _edit_dist_init(len1, len2): lev = [] for i in range(len1): lev.append([0] * len2) # initialize 2D array to zero for i in range(len1): lev[i][0] = i # column 0: 0,1,2,3,4,... for j in range(len2): lev[0][j] = j # row 0: 0,1,2,3,4,... return lev def_edit_dist_step(lev,i,j,s1,s2,transpositions=False): c1 =s1[i-1] c2 =s2[j-1] # skipping a character in s1 a =lev[i-1][j] +1 # skipping a character in s2 b =lev[i][j -1]+1 # substitution c =lev[i-1][j-1]+(c1!=c2) # transposition d =c+1 # never picked by default if transpositions and i>1 and j>1: if s1[i -2]==c2 and s2[j -2]==c1: d =lev[i-2][j-2]+1 # pick the cheapest lev[i][j] =min(a,b,c,d) def edit_distance(s1, s2, transpositions=False): # set up a 2-D array len1 = len(s1) len2 = len(s2) lev = _edit_dist_init(len1 + 1, len2 + 1) # iterate over the array for i in range(len1): for j in range(len2): _edit_dist_step(lev, i + 1, j + 1, s1, s2, transpositions=transpositions) return lev[len1][len2]
Let's have a look at the Edit Distance calculated in NLTK using the nltk.metrics
package:
>>> import nltk >>> from nltk.metrics import * >>> edit_distance("relate","relation") 3 >>> edit_distance("suggestion","calculation") 7
Here, when we calculate the edit distance between relate
and relation
, three operations (one substitution and two insertions) are performed. While calculating the edit distance between suggestion
and calculation
, seven operations (six substitutions and one insertion) are performed.
Jaccard's coefficient, or Tanimoto coefficient, may be defined as a measure of the overlap of two sets, X and Y.
It may be defined as follows:
The code for Jaccard's similarity may be given as follows:
def jacc_similarity(query, document): first=set(query).intersection(set(document)) second=set(query).union(set(document)) return len(first)/len(second)
Let's have a look at the implementation of Jaccard's similarity coefficient using NLTK:
>>> import nltk >>> from nltk.metrics import * >>> X=set([10,20,30,40]) >>> Y=set([20,30,60]) >>> print(jaccard_distance(X,Y)) 0.6
The Smith Waterman distance is similar to edit distance. This similarity metric was developed in order to detect the optical alignments between related protein sequences and DNA. It consists of costs to be assigned to and functions for alphabet mapping to cost values (substitution); cost is also assigned to gap G (insertion or deletion):
Similar to Edit distance, the Python code for Smith Waterman can be embedded with the nltk.metrics
package to perform string similarity using Smith Waterman in NLTK.
Binary distance is a string similarity metric. It returns the value 0.0
if two labels are identical; otherwise, it returns the value 1.0
.
The Python code for Binary distance metrics is:
def binary_distance(label1, label2): return 0.0 if label1 == label2 else 1.0
Let's see how Binary distance metrics is implemented in NLTK:
>>> import nltk >>> from nltk.metrics import * >>> X = set([10,20,30,40]) >>> Y= set([30,50,70]) >>> binary_distance(X, Y) 1.0
Masi distance is based on partial agreement when multiple labels are present.
The Python code included in nltk.metrics
for masi
distance is as follows:
def masi_distance(label1, label2): len_intersection = len(label1.intersection(label2)) len_union = len(label1.union(label2)) len_label1 = len(label1) len_label2 = len(label2) if len_label1 == len_label2 and len_label1 == len_intersection: m = 1 elif len_intersection == min(len_label1, len_label2): m = 0.67 elif len_intersection > 0: m = 0.33 else: m = 0 return 1 - (len_intersection / float(len_union)) * m
Let's see the implementation of masi
distance in NLTK:
>>> import nltk >>> from __future__ import print_function >>> from nltk.metrics import * >>> X = set([10,20,30,40]) >>> Y= set([30,50,70]) >>> print(masi_distance(X,Y)) 0.945