Disambiguating senses using Wordnet

Disambiguation is the task of distinguishing two or more of the same spellings or the same sounding words on the basis of their sense or meaning.

Following are the implementations of disambiguation or the WSD task using Python technologies:

  • Lesk algorithms:
    • Original Lesk
    • Cosine Lesk (use cosines to calculate overlaps instead of using raw counts)
    • Simple Lesk (with definitions, example(s), and hyper+hyponyms)
    • Adapted/extended Lesk
    • Enhanced Lesk
  • Maximizing similarity:
    • Information content
    • Path similarity
  • Supervised WSD:
    • It Makes Sense (IMS)
    • SVM WSD
  • Vector Space models:
    • Topic Models, LDA
    • LSI/LSA
    • NMF
  • Graph-based models:
    • Babelfly
    • UKB
  • Baselines:
    • Random sense
    • Highest lemma counts
    • First NLTK sense

Wordnet sense similarity in NLTK involves the following algorithms:

  • Resnik Score: On comparing two tokens, a score (Least Common Subsumer) is returned that decides the similarity of two tokens
  • Wu-Palmer Similarity: Defines the similarity between two tokens on the basis of the depth of two senses and Least Common Subsumer
  • Path Distance Similarity: The similarity of two tokens is determined on the basis of the shortest distance that is computed in the is-a taxonomy
  • Leacock Chodorow Similarity: A similarity score is returned on the basis of the shortest path and the depth (maximum) in which the senses exist in the taxonomy
  • Lin Similarity: A similarity score is returned on the basis of the information content of the Least Common Subsumer and two input Synsets
  • Jiang-Conrath Similarity: A similarity score is returned on the basis of the content information of Least Common Subsumer and two input Synsets

Consider the following example in NLTK, which depicts path similarity:

>>> import nltk
>>> from nltk.corpus import wordnet
>>> from nltk.corpus import wordnet as wn
>>> lion = wn.synset('lion.n.01')
>>> cat = wn.synset('cat.n.01')
>>> lion.path_similarity(cat)
0.25

Consider the following example in NLTK that depicts Leacock Chodorow Similarity:

>>> import nltk
>>> from nltk.corpus import wordnet
>>> from nltk.corpus import wordnet as wn
>>> lion = wn.synset('lion.n.01')
>>> cat = wn.synset('cat.n.01')
>>> lion.lch_similarity(cat)
2.2512917986064953

Consider the following example in NLTK that depicts Wu-Palmer Similarity:

>>> import nltk
>>> from nltk.corpus import wordnet
>>> from nltk.corpus import wordnet as wn
>>> lion = wn.synset('lion.n.01')
>>> cat = wn.synset('cat.n.01')
>>> lion.wup_similarity(cat)
0.896551724137931

Consider the following example in NLTK that depicts Resnik Similarity, Lin Similarity, and Jiang-Conrath Similarity:

>>> import nltk
>>> from nltk.corpus import wordnet
>>> from nltk.corpus import wordnet as wn
>>> from nltk.corpus import wordnet_ic
>>> brown_ic = wordnet_ic.ic('ic-brown.dat')
>>> semcor_ic = wordnet_ic.ic('ic-semcor.dat')
>>> from nltk.corpus import genesis
>>> genesis_ic = wn.ic(genesis, False, 0.0)
>>> lion = wn.synset('lion.n.01')
>>> cat = wn.synset('cat.n.01')
>>> lion.res_similarity(cat, brown_ic)
8.663481537685325
>>> lion.res_similarity(cat, genesis_ic)
7.339696591781995
>>> lion.jcn_similarity(cat, brown_ic)
0.36425897775957294
>>> lion.jcn_similarity(cat, genesis_ic)
0.3057800856788946
>>> lion.lin_similarity(cat, semcor_ic)
0.8560734335071154

Let's see the following code in NLTK based on Wu-Palmer Similarity and Path Distance Similarity:

from nltk.corpus import wordnet as wn
def getSenseSimilarity(worda,wordb):
"""
find similarity betwwn word senses of two words
"""
wordasynsets = wn.synsets(worda)
wordbsynsets = wn.synsets(wordb)
synsetnamea = [wn.synset(str(syns.name)) for syns in wordasynsets]
	synsetnameb = [wn.synset(str(syns.name)) for syns in wordbsynsets]

for sseta, ssetb in [(sseta,ssetb) for sseta in synsetnamea
for ssetb in synsetnameb]:
pathsim = sseta.path_similarity(ssetb)
wupsim = sseta.wup_similarity(ssetb)
if pathsim != None:
print "Path Sim Score: ",pathsim," WUP Sim Score: ",wupsim,
"	",sseta.definition, "	", ssetb.definition

if __name__ == "__main__":
#getSenseSimilarity('walk','dog')
getSenseSimilarity('cricket','ball')

Consider the following code of a Lesk algorithm in NLTK , which is used to perform the disambiguation task:

from nltk.corpus import wordnet


def lesk(context_sentence, ambiguous_word, pos=None, synsets=None):
    """Return a synset for an ambiguous word in a context.

    :param iter context_sentence: The context sentence where the ambiguous word
    occurs, passed as an iterable of words.
    :param str ambiguous_word: The ambiguous word that requires WSD.
    :param str pos: A specified Part-of-Speech (POS).
    :param iter synsets: Possible synsets of the ambiguous word.
    :return: ``lesk_sense`` The Synset() object with the highest signature overlaps.

//    This function is an implementation of the original Lesk algorithm (1986) [1].

    Usage example::

>>> lesk(['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.'], 'bank', 'n')
        Synset('savings_bank.n.02')


    context = set(context_sentence)
    if synsets is None:
        synsets = wordnet.synsets(ambiguous_word)

    if pos:
        synsets = [ss for ss in synsets if str(ss.pos()) == pos]

    if not synsets:
        return None

    _, sense = max(
        (len(context.intersection(ss.definition().split())), ss) for ss in synsets
    )

    return sense
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset