You can train your own named entity chunker using the
ieer
corpus, which stands for Information Extraction: Entity Recognition. It takes a bit of extra work, though, because the ieer
corpus has chunk trees but no part-of-speech tags for words.
Using the ieertree2conlltags()
and ieer_chunked_sents()
functions in chunkers.py
, we can create named entity chunk trees from the ieer
corpus to train the ClassifierChunker
class created in the Classification-based chunking recipe:
import nltk.tag from nltk.chunk.util import conlltags2tree from nltk.corpus import ieer def ieertree2conlltags(tree, tag=nltk.tag.pos_tag): words, ents = zip(*tree.pos()) iobs = [] prev = None for ent in ents: if ent == tree.label(): iobs.append('O') prev = None elif prev == ent: iobs.append('I-%s' % ent) else: iobs.append('B-%s' % ent) prev = ent words, tags = zip(*tag(words)) return zip(words, tags, iobs) def ieer_chunked_sents(tag=nltk.tag.pos_tag): for doc in ieer.parsed_docs(): tagged = ieertree2conlltags(doc.text, tag) yield conlltags2tree(tagged)
We'll use 80 out of 94 sentences for training, and the rest for testing. Then, we can see how it does on the first sentence of the treebank_chunk
corpus:
>>> from chunkers import ieer_chunked_sents, ClassifierChunker >>> from nltk.corpus import treebank_chunk >>> ieer_chunks = list(ieer_chunked_sents()) >>> len(ieer_chunks) 94 >>> chunker = ClassifierChunker(ieer_chunks[:80]) >>> chunker.parse(treebank_chunk.tagged_sents()[0]) Tree('S', [Tree('LOCATION', [('Pierre', 'NNP'), ('Vinken', 'NNP')]), (',', ','), Tree('DURATION', [('61', 'CD'), ('years', 'NNS')]), Tree('MEASURE', [('old', 'JJ')]), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), Tree('DATE', [('Nov.', 'NNP'), ('29', 'CD')]), ('.', '.')])
So, it found a correct DURATION
and DATE
, but tagged Pierre Vinken
as a LOCATION
. Let's see how it scores against the rest of the ieer
chunk trees:
>>> score = chunker.evaluate(ieer_chunks[80:]) >>> score.accuracy() 0.8829018388070625 >>> score.precision() 0.4088717454194793 >>> score.recall() 0.5053635280095352
Accuracy is pretty good, but precision and recall are very low. That means lots of false negatives and false positives.
The truth is, we're not working with ideal training data. The ieer
trees generated by ieer_chunked_sents()
are not entirely accurate. First, there are no explicit sentence breaks, so each document is a single tree. Second, the words are not explicitly tagged, so we have to guess using nltk.tag.pos_tag()
.
The ieer
corpus provides a parsed_docs()
method that returns a list of documents with a text
attribute. This text
attribute is a document Tree
that is converted to a list of 3-tuples of the form (word, pos, iob)
. To get these final 3-tuples, we must first flatten the Tree
using tree.pos()
, which returns a list of 2-tuples of the form (word, entity)
, where entity
is either the entity tag or the top tag of the tree. Any words whose entity is the top tag are outside the named entity chunks and get the IOB tag O
. All words that have unique entity tags are either the beginning of or inside a named entity chunk. Once we have all the IOB tags, then we can get the part-of-speech tags of all the words and join the words, part-of-speech tags, and IOB tags into 3-tuples using zip()
.
Despite the non-ideal training data, the ieer
corpus provides a good place to start for training a named entity chunker. The data comes from New York Times and AP Newswire reports. Each doc from ieer.parsed_docs()
also contains a headline attribute that is a Tree
:
>>> from nltk.corpus import ieer >>> ieer.parsed_docs()[0].headline Tree('DOCUMENT', ['Kenyans', 'protest', 'tax', 'hikes'])