Chapter 6. Semantic Analysis – Meaning Matters

Semantic analysis, or meaning generation is one of the tasks in NLP. It is defined as the process of determining the meaning of character sequences or word sequences. It may be used for performing the task of disambiguation.

This chapter will include the following topics:

  • NER
  • NER system using the HMM
  • Training NER using machine learning toolkits
  • NER using POS tagging
  • Generation of the synset id from Wordnet
  • Disambiguating senses using Wordnet

Introducing semantic analysis

NLP means performing computations on natural language. One of the steps performed while processing a natural language is semantic analysis. While analyzing an input sentence, if the syntactic structure of a sentence is built, then the semantic analysis of a sentence will be done. Semantic interpretation means mapping a meaning to a sentence. Contextual interpretation is mapping the logical form to the knowledge representation. The primitive or the basic unit of semantic analysis is referred to as meaning or sense. One of the tools dealing with senses is ELIZA. ELIZA was developed in the sixties by Joseph Weizenbaum. It made use of substitution and pattern matching techniques to analyze the sentence and provide an output to the given input. MARGIE was developed by Robert Schank in the seventies. It could represent all the English verbs using 11 primitives. MARGIE could interpret the sense of a sentence and represent it with the help of primitives. It further gave way to the concept of scripts. From MARGIE, Script Applier Mechanism (SAM) was developed. It could translate a sentence from different languages, such as English, Chinese, Russian, Dutch, and Spanish. In order to perform processing on textual data, a Python library or TextBlob is used. TextBlob provides APIs for performing NLP tasks, such as Part-of-Speech tagging, extraction of Noun Phrases, classification, machine translation, sentiment analysis.

Semantic analysis can be used to query a database and retrieve information. Another Python library, Gensim, can be used to perform document indexing, topic modeling, and similarity retrieval. Polyglot is an NLP tool that supports various multilingual applications. It provides NER for 40 different languages, tokenization for 165 different languages, language detection for 196 different languages, sentiment analysis for 136 different languages, POS tagging for 16 different languages, Morphological Analysis for 135 different languages, word embedding for 137 different languages, and transliteration for 69 different languages. MontyLingua is an NLP tool that is used to perform the semantic interpretation of English text. From English sentences, it extracts semantic information, such as verbs, nouns, adjectives, dates, phrases, and so on.

Sentences can be formally represented using logics. The basic expressions or sentences in propositional logic are represented using propositional symbols, such as P,Q, R, and so on. Complex expressions in propositional logic can be represented using Boolean operators. For example, to represent the sentence If it is raining, I'll wear a raincoat using propositional logic:

  • P: It is raining.
  • Q: I'll wear raincoat.
  • P→Q: If it is raining, I'll wear a raincoat.

Consider the following code to represent operators used in NLTK:

>>> import nltk
>>> nltk.boolean_ops()
negation	-
conjunction	&
disjunction	|
implication	->
equivalence	<->

Well-formed Formulas (WFF) are formed using propositional symbols or using a combination of propositional symbols and Boolean operators.

Let's see the following code in NLTK, that categorizes logical expressions into different subclasses:

>>> import nltk
>>> input_expr = nltk.sem.Expression.from string
>>> input_expr('X | (Y -> Z)')
<OrExpression (X | (Y -> Z))>
>>> input_expr('-(X & Y)')
<NegatedExpression -(X & Y)>
>>> input_expr('X & Y')
<AndExpression (X & Y)>
>>> input_expr('X <-> -- X')
<IffExpression (X <-> --X)>

For mapping True or False values to logical expressions, the Valuation function is used in NLTK:

>>> import nltk
>>> value = nltk.Valuation([('X', True), ('Y', False), ('Z', True)])
>>> value['Z']
True
>>> domain = set()
>>> v = nltk.Assignment(domain)
>>> u = nltk.Model(domain, value)
>>> print(u.evaluate('(X & Y)', v))
False
>>> print(u.evaluate('-(X & Y)', v))
True
>>> print(u.evaluate('(X & Z)', v))
True
>>> print(u.evaluate('(X | Y)', v))
True

First order predicate logic involving constants and predicates in NLTK are depicted in the following code:

>>> import nltk
>>> input_expr = nltk.sem.Expression.from string
>>> expression = input_expr('run(marcus)', type_check=True)
>>> expression.argument
<ConstantExpressionmarcus>
>>> expression.argument.type
e
>>> expression.function
<ConstantExpression run>
>>> expression.function.type
<e,?>
>>> sign = {'run': '<e, t>'}
>>> expression = input_expr('run(marcus)', signature=sign)
>>> expression.function.type
e

The signature is used in NLTK to map associated types and non-logical constants. Consider the following code in NLTK that helps to generate a query and retrieve data from the database:

>>> import nltk
>>> nltk.data.show_cfg('grammars/book_grammars/sql1.fcfg')
% start S
S[SEM=(?np + WHERE + ?vp)] -> NP[SEM=?np] VP[SEM=?vp]
VP[SEM=(?v + ?pp)] -> IV[SEM=?v] PP[SEM=?pp]
VP[SEM=(?v + ?ap)] -> IV[SEM=?v] AP[SEM=?ap]
VP[SEM=(?v + ?np)] -> TV[SEM=?v] NP[SEM=?np]
VP[SEM=(?vp1 + ?c + ?vp2)] -> VP[SEM=?vp1] Conj[SEM=?c] VP[SEM=?vp2]
NP[SEM=(?det + ?n)] ->Det[SEM=?det] N[SEM=?n]
NP[SEM=(?n + ?pp)]  -> N[SEM=?n] PP[SEM=?pp]
NP[SEM=?n]  -> N[SEM=?n]  | CardN[SEM=?n] 
CardN[SEM='1000'] -> '1,000,000' 
PP[SEM=(?p + ?np)] -> P[SEM=?p] NP[SEM=?np]
AP[SEM=?pp] -> A[SEM=?a] PP[SEM=?pp]
NP[SEM='Country="greece"'] -> 'Greece'
NP[SEM='Country="china"'] -> 'China'
Det[SEM='SELECT'] -> 'Which' | 'What'
Conj[SEM='AND'] -> 'and'
N[SEM='City FROM city_table'] -> 'cities'
N[SEM='Population'] -> 'populations'
IV[SEM=''] -> 'are'
TV[SEM=''] -> 'have'
A -> 'located'
P[SEM=''] -> 'in'
P[SEM='>'] -> 'above'
>>> from nltk import load_parser
>>> test = load_parser('grammars/book_grammars/sql1.fcfg')
>>> q=" What cities are in Greece"
>>> t = list(test.parse(q.split()))
>>> ans = t[0].label()['SEM']
>>> ans = [s for s in ans if s]
>>> q = ' '.join(ans)
>>> print(q)
SELECT City FROM city_table WHERE Country="greece"
>>> from nltk.sem import chat80
>>> r = chat80.sql_query('corpora/city_database/city.db', q)
>>> for p in r:
print(p[0], end=" ")

athens

Introducing NER

Named entity recognition (NER) is the process in which proper nouns or named entities are located in a document. Then, these Named Entities are classified into different categories, such as Name of Person, Location, Organization, and so on.

There are 12 NER tagsets defined by IIIT-Hyderabad IJCNLP 2008. These are described here:

SNO.

Named entity tag

Meaning

1

NEP

Name of Person

2

NED

Name of Designation

3

NEO

Name of Organization

4

NEA

Name of Abbreviation

5

NEB

Name of Brand

6

NETP

Title of Person

7

NETO

Title of Object

8

NEL

Name of Location

9

NETI

Time

10

NEN

Number

11

NEM

Measure

12

NETE

Terms

One of the applications of NER is information extraction. In NLTK, we can perform the task of information extraction by storing the tuple (entity, relation, entity), and then, the entity value can be retrieved.

Consider an example in NLTK that shows how information extraction is performed:

>>> import nltk
>>> locations=[('Jaipur', 'IN', 'Rajasthan'),('Ajmer', 'IN', 'Rajasthan'),('Udaipur', 'IN', 'Rajasthan'),('Mumbai', 'IN', 'Maharashtra'),('Ahmedabad', 'IN', 'Gujrat')]
>>> q = [x1 for (x1, relation, x2) in locations if x2=='Rajasthan']
>>> print(q)
['Jaipur', 'Ajmer', 'Udaipur']

The nltk.tag.stanford module is used that makes use of stanford taggers to perform NER. We can download tagger models from http://nlp.stanford.edu/software.

Let's see the following example in NLTK that can be used to perform NER using the Stanford tagger:

>>> from nltk.tag import StanfordNERTagger
>>> sentence = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
>>> sentence.tag('John goes to NY'.split()) 
[('John', 'PERSON'), ('goes', 'O'), ('to', 'O'),('NY', 'LOCATION')]

A classifier has been trained in NLTK to detect Named Entities. Using the function nltk.ne.chunk(), named entities can be identified from a text. If the parameter binary is set to true, then the named entities are detected and tagged with the NE tag; otherwise the named entities are tagged with tags such as PERSON, GPE, and ORGANIZATION.

Let's see the following code, that detects Named Entities, if they exist, and tags them with the NE tag:

>>> import nltk
>>> sentences1 = nltk.corpus.treebank.tagged_sents()[17]
>>> print(nltk.ne_chunk(sentences1, binary=True))
(S
  The/DT
total/NN
of/IN
  18/CD
deaths/NNS
from/IN
malignant/JJ
mesothelioma/NN
  ,/,
lung/NN
cancer/NN
and/CC
asbestosis/NN
was/VBD
far/RB
higher/JJR
than/IN
  */-NONE-
expected/VBN
  *?*/-NONE-
  ,/,
the/DT
researchers/NNS
said/VBD
  0/-NONE-
  *T*-1/-NONE-
  ./.)
>>> sentences2 = nltk.corpus.treebank.tagged_sents()[7]
>>> print(nltk.ne_chunk(sentences2, binary=True))
(S
  A/DT
  (NE Lorillard/NNP)
spokewoman/NN
said/VBD
  ,/,
  ``/``
  This/DT
is/VBZ
an/DT
old/JJ
story/NN
  ./.)
>>> print(nltk.ne_chunk(sentences2))
(S
  A/DT
  (ORGANIZATION Lorillard/NNP)
spokewoman/NN
said/VBD
  ,/,
  ``/``
  This/DT
is/VBZ
an/DT
old/JJ
story/NN
  ./.)

Consider another example in NLTK that can be used to detect named entities:

>>> import nltk
>>> from nltk.corpus import conll2002
>>> for documents in conll2002.chunked_sents('ned.train')[25]:
print(documents)


(PER Vandenbussche/Adj)
('zelf', 'Pron')
('besloot', 'V')
('dat', 'Conj')
('het', 'Art')
('hof', 'N')
('"', 'Punc')
('de', 'Art')
('politieke', 'Adj')
('zeden', 'N')
('uit', 'Prep')
('het', 'Art')
('verleden', 'N')
('"', 'Punc')
('heeft', 'V')
('willen', 'V')
('veroordelen', 'V')
('.', 'Punc')

A chunker is a program that is used to partition plain text into a sequence of semantically related words. To perform NER in NLTK, default chunkers are used. Default chunkers are chunkers based on classifiers that have been trained on the ACE corpus. Other chunkers have been trained on parsed or chunked NLTK corpora. The languages covered by these NLTK chunkers are as follows:

  • Dutch
  • Spanish
  • Portuguese
  • English

Consider another example in NLTK that identifies named entities and categorizes into different named entity classes:

>>> import nltk
>>> sentence = "I went to Greece to meet John";
>>> tok=nltk.word_tokenize(sentence)
>>> pos_tag=nltk.pos_tag(tok)
>>> print(nltk.ne_chunk(pos_tag))
(S
  I/PRP
went/VBD
to/TO
  (GPE Greece/NNP)
to/TO
meet/VB
  (PERSON John/NNP))

A NER system using Hidden Markov Model

HMM is one of the popular statistical approaches of NER. An HMM is defined as a Stochastic Finite State Automaton (SFSA) consisting of a finite set of states that are associated with the definite probability distribution. States are unobserved or hidden. HMM generates optimal state sequences as an output. HMM is based on the Markov Chain property. According to the Markov Chain property, the probability of the occurrence of the next state is dependent on the previous tag. It is the simplest approach to implement. The drawback of HMM is that it requires a large amount of training and it cannot be used for large dependencies. HMM consists of the following:

  • Set of states, S, where |S|=N. Here, N is the total number of states.
  • Start state, S0.
  • Output alphabet, O;|O|=k. k is the total number of output alphabets.
  • Transition probability, A.
  • Emission probability, B.
  • Initial state probabilities, π.

HMM is represented by the following tuple—λ= (A, B, π).

Start probability or initial state probability may be defined as the probability that a particular tag occurs first in a sentence.

Transition probability (A=aij) may be defined as the probability of the occurrence of the next tag j in a sentence given the occurrence of the particular tag i at present.

A=aij= the number of transitions from state si to sj /the number of transitions from state si

Emission probability (B=bj(O)) may be defined as the probability of the occurrence of an output sequence given a state j.

B=bj(k)= the number of times in state j and observing the symbol k /the expected number of times in state j.

The Baum Welch algorithm is used to find the maximum likelihood and the posterior mode estimates for HMM parameters. The forward-backward algorithm is used to find the posterior marginals of all the hidden state variables given a sequence of emissions or observations.

There are three steps involved in performing NER using HMM—Annotation, HMM train, and HMM test. The Annotation module converts raw text into annotated or trainable data. During HMM train, we compute HMM parameters—start probability, transition probability, and emission probability. During HMM test, the Viterbi algorithm is used. that finds out the optimal tag sequence.

Consider an example of chunking using the HMM in NLTK. Using chunking, the NP and VP chunks can be obtained. NP chunks can further be processed to obtain proper nouns or named entities:

>>> import nltk
>>> nltk.tag.hmm.demo_pos()

HMM POS tagging demo

Training HMM...
Testing...
Test: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

Untagged: the fulton county grand jury said friday an investigation of atlanta's recent primary election produced `` no evidence '' that any irregularities took place .

HMM-tagged: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

Entropy: 18.7331739705

------------------------------------------------------------
Test: the/AT jury/NN further/RBR said/VBD in/IN term-end/NN presentments/NNS that/CS the/AT city/NN executive/JJ committee/NN ,/, which/WDT had/HVD over-all/JJ charge/NN of/IN the/AT election/NN ,/, ``/`` deserves/VBZ the/AT praise/NN and/CC thanks/NNS of/IN the/AT city/NN of/IN atlanta/NP ''/'' for/IN the/AT manner/NN in/IN which/WDT the/AT election/NN was/BEDZ conducted/VBN ./.

Untagged: the jury further said in term-end presentments that the city executive committee , which had over-all charge of the election , `` deserves the praise and thanks of the city of atlanta '' for the manner in which the election was conducted .

HMM-tagged: the/AT jury/NN further/RBR said/VBD in/IN term-end/AT presentments/NN that/CS the/AT city/NN executive/NN committee/NN ,/, which/WDT had/HVD over-all/VBN charge/NN of/IN the/AT election/NN ,/, ``/`` deserves/VBZ the/AT praise/NN and/CC thanks/NNS of/IN the/AT city/NN of/IN atlanta/NP ''/'' for/IN the/AT manner/NN in/IN which/WDT the/AT election/NN was/BEDZ conducted/VBN ./.

Entropy: 27.0708725519

------------------------------------------------------------
Test: the/AT september-october/NP term/NN jury/NN had/HVD been/BEN charged/VBN by/IN fulton/NP superior/JJ court/NN judge/NN durwood/NP pye/NP to/TO investigate/VB reports/NNS of/IN possible/JJ ``/`` irregularities/NNS ''/'' in/IN the/AT hard-fought/JJ primary/NN which/WDT was/BEDZ won/VBN by/IN mayor-nominate/NN ivan/NP allen/NP jr./NP ./.

Untagged: the september-october term jury had been charged by fulton superior court judge durwoodpye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by mayor-nominate ivanallenjr. .

HMM-tagged: the/AT september-october/JJ term/NN jury/NN had/HVD been/BEN charged/VBN by/IN fulton/NP superior/JJ court/NN judge/NN durwood/TO pye/VB to/TO investigate/VB reports/NNS of/IN possible/JJ ``/`` irregularities/NNS ''/'' in/IN the/AT hard-fought/JJ primary/NN which/WDT was/BEDZ won/VBN by/IN mayor-nominate/NP ivan/NP allen/NP jr./NP ./.

Entropy: 33.8281874237

------------------------------------------------------------
Test: ``/`` only/RB a/AT relative/JJ handful/NN of/IN such/JJ reports/NNS was/BEDZ received/VBN ''/'' ,/, the/AT jury/NN said/VBD ,/, ``/`` considering/IN the/AT widespread/JJ interest/NN in/IN the/AT election/NN ,/, the/AT number/NN of/IN voters/NNS and/CC the/AT size/NN of/IN this/DT city/NN ''/'' ./.

Untagged: `` only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .

HMM-tagged: ``/`` only/RB a/AT relative/JJ handful/NN of/IN such/JJ reports/NNS was/BEDZ received/VBN ''/'' ,/, the/AT jury/NN said/VBD ,/, ``/`` considering/IN the/AT widespread/JJ interest/NN in/IN the/AT election/NN ,/, the/AT number/NN of/IN voters/NNS and/CC the/AT size/NN of/IN this/DT city/NN ''/'' ./.

Entropy: 11.4378198596

------------------------------------------------------------
Test: the/AT jury/NN said/VBD it/PPS did/DOD find/VB that/CS many/AP of/IN georgia's/NP$ registration/NN and/CC election/NN laws/NNS ``/`` are/BER outmoded/JJ or/CC inadequate/JJ and/CC often/RB ambiguous/JJ ''/'' ./.

Untagged: the jury said it did find that many of georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' .

HMM-tagged: the/AT jury/NN said/VBD it/PPS did/DOD find/VB that/CS many/AP of/IN georgia's/NP$ registration/NN and/CC election/NN laws/NNS ``/`` are/BER outmoded/VBG or/CC inadequate/JJ and/CC often/RB ambiguous/VB ''/'' ./.
Entropy: 20.8163623192

------------------------------------------------------------
Test: it/PPS recommended/VBD that/CS fulton/NP legislators/NNS act/VB ``/`` to/TO have/HV these/DTS laws/NNS studied/VBN and/CC revised/VBN to/IN the/AT end/NN of/IN modernizing/VBG and/CC improving/VBG them/PPO ''/'' ./.

Untagged: it recommended that fulton legislators act `` to have these laws studied and revised to the end of modernizing and improving them '' .

HMM-tagged: it/PPS recommended/VBD that/CS fulton/NP legislators/NNS act/VB ``/`` to/TO have/HV these/DTS laws/NNS studied/VBD and/CC revised/VBD to/IN the/AT end/NN of/IN modernizing/NP and/CC improving/VBG them/PPO ''/'' ./.

Entropy: 20.3244921203

------------------------------------------------------------
Test: the/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN other/AP topics/NNS ,/, among/IN them/PPO the/AT atlanta/NP and/CC fulton/NP county/NN purchasing/VBG departments/NNS which/WDT it/PPS said/VBD ``/`` are/BER well/QL operated/VBN and/CC follow/VB generally/RB accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT interest/NN of/IN both/ABX governments/NNS ''/'' ./.

Untagged: the grand jury commented on a number of other topics , among them the atlanta and fulton county purchasing departments which it said `` are well operated and follow generally accepted practices which inure to the best interest of both governments '' .

HMM-tagged: the/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN other/AP topics/NNS ,/, among/IN them/PPO the/AT atlanta/NP and/CC fulton/NP county/NN purchasing/NN departments/NNS which/WDT it/PPS said/VBD ``/`` are/BER well/RB operated/VBN and/CC follow/VB generally/RB accepted/VBN practices/NNS which/WDT inure/VBZ to/IN the/AT best/JJT interest/NN of/IN both/ABX governments/NNS ''/'' ./.

Entropy: 31.3834231469

------------------------------------------------------------
Test: merger/NN proposed/VBN

Untagged: merger proposed

HMM-tagged: merger/PPS proposed/VBD

Entropy: 5.6718203946

------------------------------------------------------------
Test: however/WRB ,/, the/AT jury/NN said/VBD it/PPS believes/VBZ ``/`` these/DTS two/CD offices/NNS should/MD be/BE combined/VBN to/TO achieve/VB greater/JJR efficiency/NN and/CC reduce/VB the/AT cost/NN of/IN administration/NN ''/'' ./.

Untagged: however , the jury said it believes `` these two offices should be combined to achieve greater efficiency and reduce the cost of administration '' .

HMM-tagged: however/WRB ,/, the/AT jury/NN said/VBD it/PPS believes/VBZ ``/`` these/DTS two/CD offices/NNS should/MD be/BE combined/VBN to/TO achieve/VB greater/JJR efficiency/NN and/CC reduce/VB the/AT cost/NN of/IN administration/NN ''/'' ./.

Entropy: 8.27545943909

------------------------------------------------------------
Test: the/AT city/NN purchasing/VBG department/NN ,/, the/AT jury/NN said/VBD ,/, ``/`` is/BEZ lacking/VBG in/IN experienced/VBN clerical/JJ personnel/NNS as/CS a/AT result/NN of/IN city/NN personnel/NNS policies/NNS ''/'' ./.

Untagged: the city purchasing department , the jury said , `` is lacking in experienced clerical personnel as a result of city personnel policies '' .

HMM-tagged: the/AT city/NN purchasing/NN department/NN ,/, the/AT jury/NN said/VBD ,/, ``/`` is/BEZ lacking/VBG in/IN experienced/AT clerical/JJ personnel/NNS as/CS a/AT result/NN of/IN city/NN personnel/NNS policies/NNS ''/'' ./.

Entropy: 16.7622537278

------------------------------------------------------------
accuracy over 284 tokens: 92.96

The outcome of an NER tagger may be defined as a response and an interpretation of human beings as answer key. So, we provide the following definitions:

  • Correct: If the response is exactly the same as answer key
  • Incorrect: If the response is not same as answer key
  • Missing: If answer key is found tagged, but response is not tagged
  • Spurious: If response is found tagged, but answer key is not tagged

Performance of an NER-based system can be judged by using the following parameters:

  • Precision (P): It is defined as follows:

    P=Correct/ (Correct+Incorrect+Missing)

  • Recall (R): It is defined as follows:

    R=Correct/ (Correct+Incorrect+Spurious)

  • F-Measure: It is defined as follows:

    F-Measure = (2*PREC*REC)/(PRE+REC)

Training NER using Machine Learning Toolkits

NER can be performed using the following approaches:

  • Rule-based or Handcrafted approach:
    • List Lookup approach
    • Linguistic approach
  • Machine Learning-based approach or Automated approach:
    • Hidden Markov Model
    • Maximum Entropy Markov Model
    • Conditional Random Fields
    • Support Vector Machine
    • Decision Trees

It has been proved experimentally that Machine learning-based approaches outperform Rule-based approaches. Also, if a combination of Rule-based approaches and Machine Learning-based approaches is used, then the performance of NER will increase.

NER using POS tagging

Using POS tagging, NER can be performed. The POS tags that can be used are as follows (they are available at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html):

Tag

Description

CC

Coordinating conjunction

CD

Cardinal number

DT

Determiner

EX

Existential there

FW

Foreign word

IN

Preposition or subordinating conjunction

JJ

Adjective

JJR

Adjective, comparative

JJS

Adjective, superlative

LS

List item marker

MD

Modal

NN

Noun, singular or mass

NNS

Noun, plural

NNP

Proper noun, singular

NNPS

Proper noun, plural

PDT

Predeterminer

POS

Possessive ending

PRP

Personal pronoun

PRP$

Possessive pronoun

RB

Adverb

RBR

Adverb, comparative

RBS

Adverb, superlative

RP

Particle

SYM

Symbol

TO

To

UH

Interjection

VB

Verb, base form

VBD

Verb, past tense

VBG

Verb, gerund or present participle

VBN

Verb, past participle

VBP

Verb, non-3rd person singular present

VBZ

Verb, 3rd person singular present

WDT

Wh-determiner

WP

Wh-pronoun

WP$

Possessive wh-pronoun

WRB

Wh-adverb

If POS tagging is performed, then using POS information, named entities can be identified. The tokens tagged with the NNP tag are Named Entities.

Consider the following example in NLTK in which POS tagging is used to perform NER:

>>> import nltk
>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("John and Smith are going to NY and Germany"))
[('John', 'NNP'), ('and', 'CC'), ('Smith', 'NNP'), ('are', 'VBP'), ('going', 'VBG'), ('to', 'TO'), ('NY', 'NNP'), ('and', 'CC'), ('Germany', 'NNP')]

Here, the named entities are—John, Smith, NY, and Germany since they are tagged with the NNP tag.

Let's see another example in which POS tagging is performed in NLTK and the POS tag information is used to detect Named Entities:

>>> import nltk
>>> from nltk.corpus import brown
>>> from nltk.tag import UnigramTagger
>>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:700])
>>> sentence = ['John','and','Smith','went','to','NY','and','Germany']
>>> for word, tag in tagger.tag(sentence):
print(word,'->',tag)

	
John -> NP
and -> CC
Smith -> None
went -> VBD
to -> TO
NY -> None
and -> CC
Germany -> None

Here, John has been tagged with the NP tag, so it is identified as a named entity. Some of the tokens here are tagged with the None tag because these tokens have not been trained.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset