Named entity recognition is a specific kind of chunk extraction that uses entity tags instead of, or in addition to, chunk tags. Common entity tags include PERSON
, ORGANIZATION
, and LOCATION
. Part-of-speech tagged sentences are parsed into chunk trees as with normal chunking, but the labels of the trees can be entity tags instead of chunk phrase tags.
NLTK comes with a pre-trained named entity chunker. This chunker has been trained on data from the ACE program,
National Institute of Standards and Technology (NIST) sponsored program for Automatic Content Extraction, which you can read more about at http://www.itl.nist.gov/iad/894.01/tests/ace/. Unfortunately, this data is not included in the NLTK corpora, but the trained chunker is. This chunker can be used through the ne_chunk()
method in the nltk.chunk
module. The ne_chunk()
method will chunk a single sentence into a Tree
. The following is an example using ne_chunk()
on the first tagged sentence of the treebank_chunk
corpus:
>>> from nltk.chunk import ne_chunk >>> ne_chunk(treebank_chunk.tagged_sents()[0]) Tree('S', [Tree('PERSON', [('Pierre', 'NNP')]), Tree('ORGANIZATION', [('Vinken', 'NNP')]), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')])
You can see that two entity tags are found: PERSON
and ORGANIZATION
. Each of these subtrees contains a list of the words that are recognized as a PERSON
or ORGANIZATION
. To extract these named entities, we can write a simple helper method that will get the leaves of all the subtrees we are interested in:
def sub_leaves(tree, label): return [t.leaves() for t in tree.subtrees(lambda s: label() == label)]
Then, we can call this method to get all the PERSON
or ORGANIZATION
leaves from a tree:
>>> tree = ne_chunk(treebank_chunk.tagged_sents()[0]) >>> from chunkers import sub_leaves >>> sub_leaves(tree, 'PERSON') [[('Pierre', 'NNP')]] >>> sub_leaves(tree, 'ORGANIZATION') [[('Vinken', 'NNP')]]
You will notice that the chunker has mistakenly separated Vinken
into its own ORGANIZATION Tree
instead of including it with the PERSON Tree
containing Pierre
. Such is the case with statistical natural language processing—you can't always expect perfection.
The pre-trained named entity chunker is much like any other chunker, and in fact uses a MaxentClassifier
powered ClassifierBasedTagger
to determine IOB tags. But instead of B-NP
and I-NP
IOB tags, it uses B-PERSON
, I-PERSON
, B-ORGANIZATION
, I-ORGANIZATION
, and more. It also uses the O
tag to mark words that are not part of a named entity (and thus are outside the named entity subtrees).
To process multiple sentences at a time, you can use chunk_ne_sents()
. Here's an example where we process the first 10 sentences from treebank_chunk.tagged_sents()
and get ORGANIZATION sub_leaves()
:
>>> from nltk.chunk import chunk_ne_sents >>> trees = chunk_ne_sents(treebank_chunk.tagged_sents()[:10]) >>> [sub_leaves(t, 'ORGANIZATION') for t in trees] [[[('Vinken', 'NNP')]], [[('Elsevier', 'NNP')]], [[('Consolidated', 'NNP'), ('Gold', 'NNP'), ('Fields', 'NNP')]], [], [], [[('Inc.', 'NNP')], [('Micronite', 'NN')]], [[('New', 'NNP'), ('England', 'NNP'), ('Journal', 'NNP')]], [[('Lorillard', 'NNP')]], [], []]
You can see that there are a couple of multiword ORGANIZATION
chunks, such as New England Journal
. There were also a few sentences that had no ORGANIZATION
chunks, as indicated by the empty lists []
.
If you don't care about the particular kind of named entity to extract, you can pass binary=True
into ne_chunk()
or chunk_ne_sents()
. Now, all named entities will be tagged with NE
:
>>> ne_chunk(treebank_chunk.tagged_sents()[0], binary=True) Tree('S', [Tree('NE', [('Pierre', 'NNP'), ('Vinken', 'NNP')]), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')])
So, binary
in this case means that an arbitrary chunk either is or is not a named entity. If we get the sub_leaves()
, we can see that Pierre Vinken
is correctly combined into a single named entity:
>>> subleaves(ne_chunk(treebank_chunk.tagged_sents()[0], binary=True), 'NE') [[('Pierre', 'NNP'), ('Vinken', 'NNP')]]