A simple way to do named entity extraction is to chunk all proper nouns (tagged with NNP
). We can tag these chunks as NAME
, since the definition of a proper noun is the name of a person, place, or thing.
Using the RegexpParser
class, we can create a very simple grammar that combines all proper nouns into a NAME
chunk. Then, we can test this on the first tagged sentence of treebank_chunk
to compare the results with the previous recipe:
>>> chunker = RegexpParser(r''' ... NAME: ... {<NNP>+} ... ''') >>> sub_leaves(chunker.parse(treebank_chunk.tagged_sents()[0]), 'NAME') [[('Pierre', 'NNP'), ('Vinken', 'NNP')], [('Nov.', 'NNP')]]
Although we get Nov.
as a NAME
chunk, this isn't a wrong result, as Nov.
is the name of a month.
The NAME
chunker is a simple usage of the RegexpParser
class, covered in the Chunking and chinking with regular expressions, Merging and splitting chunks with regular expressions, and Partial parsing with regular expressions recipes. All sequences of NNP
tagged words are combined into NAME
chunks.
If we wanted to be sure to only chunk the names of people, then we can build a PersonChunker
class that uses the names
corpus for chunking. This class can be found in chunkers.py
:
from nltk.chunk import ChunkParserI from nltk.chunk.util import conlltags2tree from nltk.corpus import names class PersonChunker(ChunkParserI): def __init__(self): self.name_set = set(names.words()) def parse(self, tagged_sent): iobs = [] in_person = False for word, tag in tagged_sent: if word in self.name_set and in_person: iobs.append((word, tag, 'I-PERSON')) elif word in self.name_set: iobs.append((word, tag, 'B-PERSON')) in_person = True else: iobs.append((word, tag, 'O')) in_person = False return conlltags2tree(iobs)
The PersonChunker
class iterates over the tagged sentence, checking whether each word is in its names_set
(constructed from the names
corpus). If the current word is in the names_set
, then it uses either the B-PERSON
or I-PERSON
IOB tags, depending on whether the previous word was also in the names_set
. Any word that's not in the names_set
argument gets the O
IOB tag. When complete, the list of IOB tags is converted to a Tree
using conlltags2tree()
. Using it on the same tagged sentence as before, we get the following result:
>>> from chunkers import PersonChunker >>> chunker = PersonChunker() >>> sub_leaves(chunker.parse(treebank_chunk.tagged_sents()[0]), 'PERSON') [[('Pierre', 'NNP')]]
We no longer get Nov.
, but we've also lost Vinken
, as it is not found in the names
corpus. This recipe highlights some of the difficulties of chunk extraction and natural language processing in general: