Disambiguation is the task of distinguishing two or more of the same spellings or the same sounding words on the basis of their sense or meaning.
Following are the implementations of disambiguation or the WSD task using Python technologies:
Wordnet sense similarity in NLTK involves the following algorithms:
Consider the following example in NLTK, which depicts path similarity:
>>> import nltk >>> from nltk.corpus import wordnet >>> from nltk.corpus import wordnet as wn >>> lion = wn.synset('lion.n.01') >>> cat = wn.synset('cat.n.01') >>> lion.path_similarity(cat) 0.25
Consider the following example in NLTK that depicts Leacock Chodorow Similarity:
>>> import nltk >>> from nltk.corpus import wordnet >>> from nltk.corpus import wordnet as wn >>> lion = wn.synset('lion.n.01') >>> cat = wn.synset('cat.n.01') >>> lion.lch_similarity(cat) 2.2512917986064953
Consider the following example in NLTK that depicts Wu-Palmer Similarity:
>>> import nltk >>> from nltk.corpus import wordnet >>> from nltk.corpus import wordnet as wn >>> lion = wn.synset('lion.n.01') >>> cat = wn.synset('cat.n.01') >>> lion.wup_similarity(cat) 0.896551724137931
Consider the following example in NLTK that depicts Resnik Similarity, Lin Similarity, and Jiang-Conrath Similarity:
>>> import nltk >>> from nltk.corpus import wordnet >>> from nltk.corpus import wordnet as wn >>> from nltk.corpus import wordnet_ic >>> brown_ic = wordnet_ic.ic('ic-brown.dat') >>> semcor_ic = wordnet_ic.ic('ic-semcor.dat') >>> from nltk.corpus import genesis >>> genesis_ic = wn.ic(genesis, False, 0.0) >>> lion = wn.synset('lion.n.01') >>> cat = wn.synset('cat.n.01') >>> lion.res_similarity(cat, brown_ic) 8.663481537685325 >>> lion.res_similarity(cat, genesis_ic) 7.339696591781995 >>> lion.jcn_similarity(cat, brown_ic) 0.36425897775957294 >>> lion.jcn_similarity(cat, genesis_ic) 0.3057800856788946 >>> lion.lin_similarity(cat, semcor_ic) 0.8560734335071154
Let's see the following code in NLTK based on Wu-Palmer Similarity and Path Distance Similarity:
from nltk.corpus import wordnet as wn def getSenseSimilarity(worda,wordb): """ find similarity betwwn word senses of two words """ wordasynsets = wn.synsets(worda) wordbsynsets = wn.synsets(wordb) synsetnamea = [wn.synset(str(syns.name)) for syns in wordasynsets] synsetnameb = [wn.synset(str(syns.name)) for syns in wordbsynsets] for sseta, ssetb in [(sseta,ssetb) for sseta in synsetnamea for ssetb in synsetnameb]: pathsim = sseta.path_similarity(ssetb) wupsim = sseta.wup_similarity(ssetb) if pathsim != None: print "Path Sim Score: ",pathsim," WUP Sim Score: ",wupsim, " ",sseta.definition, " ", ssetb.definition if __name__ == "__main__": #getSenseSimilarity('walk','dog') getSenseSimilarity('cricket','ball')
Consider the following code of a Lesk algorithm in NLTK , which is used to perform the disambiguation task:
from nltk.corpus import wordnet def lesk(context_sentence, ambiguous_word, pos=None, synsets=None): """Return a synset for an ambiguous word in a context. :param iter context_sentence: The context sentence where the ambiguous word occurs, passed as an iterable of words. :param str ambiguous_word: The ambiguous word that requires WSD. :param str pos: A specified Part-of-Speech (POS). :param iter synsets: Possible synsets of the ambiguous word. :return: ``lesk_sense`` The Synset() object with the highest signature overlaps. // This function is an implementation of the original Lesk algorithm (1986) [1]. Usage example:: >>> lesk(['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.'], 'bank', 'n') Synset('savings_bank.n.02') context = set(context_sentence) if synsets is None: synsets = wordnet.synsets(ambiguous_word) if pos: synsets = [ss for ss in synsets if str(ss.pos()) == pos] if not synsets: return None _, sense = max( (len(context.intersection(ss.definition().split())), ss) for ss in synsets ) return sense