Training a named entity chunker

You can train your own named entity chunker using the ieer corpus, which stands for Information Extraction: Entity Recognition. It takes a bit of extra work, though, because the ieer corpus has chunk trees but no part-of-speech tags for words.

How to do it...

Using the ieertree2conlltags() and ieer_chunked_sents() functions in chunkers.py, we can create named entity chunk trees from the ieer corpus to train the ClassifierChunker class created in the Classification-based chunking recipe:

import nltk.tag
from nltk.chunk.util import conlltags2tree
from nltk.corpus import ieer

def ieertree2conlltags(tree, tag=nltk.tag.pos_tag):
  words, ents = zip(*tree.pos())
  iobs = []
  prev = None

  for ent in ents:
    if ent == tree.label():
      iobs.append('O')
      prev = None
    elif prev == ent:
      iobs.append('I-%s' % ent)
    else:
      iobs.append('B-%s' % ent)
      prev = ent

  words, tags = zip(*tag(words))
  return zip(words, tags, iobs)

def ieer_chunked_sents(tag=nltk.tag.pos_tag):
  for doc in ieer.parsed_docs():
    tagged = ieertree2conlltags(doc.text, tag)
    yield conlltags2tree(tagged)

We'll use 80 out of 94 sentences for training, and the rest for testing. Then, we can see how it does on the first sentence of the treebank_chunk corpus:

>>> from chunkers import ieer_chunked_sents, ClassifierChunker
>>> from nltk.corpus import treebank_chunk
>>> ieer_chunks = list(ieer_chunked_sents())
>>> len(ieer_chunks)
94
>>> chunker = ClassifierChunker(ieer_chunks[:80])
>>> chunker.parse(treebank_chunk.tagged_sents()[0])
Tree('S', [Tree('LOCATION', [('Pierre', 'NNP'), ('Vinken', 'NNP')]), (',', ','), Tree('DURATION', [('61', 'CD'), ('years', 'NNS')]), Tree('MEASURE', [('old', 'JJ')]), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), Tree('DATE', [('Nov.', 'NNP'), ('29', 'CD')]), ('.', '.')])

So, it found a correct DURATION and DATE, but tagged Pierre Vinken as a LOCATION. Let's see how it scores against the rest of the ieer chunk trees:

>>> score = chunker.evaluate(ieer_chunks[80:])
>>> score.accuracy()
0.8829018388070625
>>> score.precision()
0.4088717454194793
>>> score.recall()
0.5053635280095352

Accuracy is pretty good, but precision and recall are very low. That means lots of false negatives and false positives.

How it works...

The truth is, we're not working with ideal training data. The ieer trees generated by ieer_chunked_sents() are not entirely accurate. First, there are no explicit sentence breaks, so each document is a single tree. Second, the words are not explicitly tagged, so we have to guess using nltk.tag.pos_tag().

The ieer corpus provides a parsed_docs() method that returns a list of documents with a text attribute. This text attribute is a document Tree that is converted to a list of 3-tuples of the form (word, pos, iob). To get these final 3-tuples, we must first flatten the Tree using tree.pos(), which returns a list of 2-tuples of the form (word, entity), where entity is either the entity tag or the top tag of the tree. Any words whose entity is the top tag are outside the named entity chunks and get the IOB tag O. All words that have unique entity tags are either the beginning of or inside a named entity chunk. Once we have all the IOB tags, then we can get the part-of-speech tags of all the words and join the words, part-of-speech tags, and IOB tags into 3-tuples using zip().

There's more...

Despite the non-ideal training data, the ieer corpus provides a good place to start for training a named entity chunker. The data comes from New York Times and AP Newswire reports. Each doc from ieer.parsed_docs() also contains a headline attribute that is a Tree:

>>> from nltk.corpus import ieer
>>> ieer.parsed_docs()[0].headline
Tree('DOCUMENT', ['Kenyans', 'protest', 'tax', 'hikes'])

See also

The Extracting named entities recipe covers the pre-trained named entity chunker that comes included with NLTK.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset