Extracting named entities

Named entity recognition is a specific kind of chunk extraction that uses entity tags instead of, or in addition to, chunk tags. Common entity tags include PERSON, ORGANIZATION, and LOCATION. Part-of-speech tagged sentences are parsed into chunk trees as with normal chunking, but the labels of the trees can be entity tags instead of chunk phrase tags.

How to do it...

NLTK comes with a pre-trained named entity chunker. This chunker has been trained on data from the ACE program, National Institute of Standards and Technology (NIST) sponsored program for Automatic Content Extraction, which you can read more about at http://www.itl.nist.gov/iad/894.01/tests/ace/. Unfortunately, this data is not included in the NLTK corpora, but the trained chunker is. This chunker can be used through the ne_chunk() method in the nltk.chunk module. The ne_chunk() method will chunk a single sentence into a Tree. The following is an example using ne_chunk() on the first tagged sentence of the treebank_chunk corpus:

>>> from nltk.chunk import ne_chunk
>>> ne_chunk(treebank_chunk.tagged_sents()[0])
Tree('S', [Tree('PERSON', [('Pierre', 'NNP')]), Tree('ORGANIZATION', [('Vinken', 'NNP')]), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')])

You can see that two entity tags are found: PERSON and ORGANIZATION. Each of these subtrees contains a list of the words that are recognized as a PERSON or ORGANIZATION. To extract these named entities, we can write a simple helper method that will get the leaves of all the subtrees we are interested in:

def sub_leaves(tree, label):
  return [t.leaves() for t in tree.subtrees(lambda s: label() == label)]

Then, we can call this method to get all the PERSON or ORGANIZATION leaves from a tree:

>>> tree = ne_chunk(treebank_chunk.tagged_sents()[0])
>>> from chunkers import sub_leaves
>>> sub_leaves(tree, 'PERSON')
[[('Pierre', 'NNP')]]
>>> sub_leaves(tree, 'ORGANIZATION')
[[('Vinken', 'NNP')]]

You will notice that the chunker has mistakenly separated Vinken into its own ORGANIZATION Tree instead of including it with the PERSON Tree containing Pierre. Such is the case with statistical natural language processing—you can't always expect perfection.

How it works...

The pre-trained named entity chunker is much like any other chunker, and in fact uses a MaxentClassifier powered ClassifierBasedTagger to determine IOB tags. But instead of B-NP and I-NP IOB tags, it uses B-PERSON, I-PERSON, B-ORGANIZATION, I-ORGANIZATION, and more. It also uses the O tag to mark words that are not part of a named entity (and thus are outside the named entity subtrees).

There's more...

To process multiple sentences at a time, you can use chunk_ne_sents(). Here's an example where we process the first 10 sentences from treebank_chunk.tagged_sents() and get ORGANIZATION sub_leaves():

>>> from nltk.chunk import chunk_ne_sents
>>> trees = chunk_ne_sents(treebank_chunk.tagged_sents()[:10])
>>> [sub_leaves(t, 'ORGANIZATION') for t in trees]
[[[('Vinken', 'NNP')]], [[('Elsevier', 'NNP')]], [[('Consolidated', 'NNP'), ('Gold', 'NNP'), ('Fields', 'NNP')]], [], [], [[('Inc.', 'NNP')], [('Micronite', 'NN')]], [[('New', 'NNP'), ('England', 'NNP'), ('Journal', 'NNP')]], [[('Lorillard', 'NNP')]], [], []]

You can see that there are a couple of multiword ORGANIZATION chunks, such as New England Journal. There were also a few sentences that had no ORGANIZATION chunks, as indicated by the empty lists [].

Binary named entity extraction

If you don't care about the particular kind of named entity to extract, you can pass binary=True into ne_chunk() or chunk_ne_sents(). Now, all named entities will be tagged with NE:

>>> ne_chunk(treebank_chunk.tagged_sents()[0], binary=True)
Tree('S', [Tree('NE', [('Pierre', 'NNP'), ('Vinken', 'NNP')]), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')])

So, binary in this case means that an arbitrary chunk either is or is not a named entity. If we get the sub_leaves() , we can see that Pierre Vinken is correctly combined into a single named entity:

>>> subleaves(ne_chunk(treebank_chunk.tagged_sents()[0], binary=True), 'NE')
[[('Pierre', 'NNP'), ('Vinken', 'NNP')]]

See also

In the next recipe, we'll create our own simple named entity chunker.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset