Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Extracting proper noun chunks

A simple way to do named entity extraction is to chunk all proper nouns (tagged with NNP). We can tag these chunks as NAME, since the definition of a proper noun is the name of a person, place, or thing.

How to do it...

Using the RegexpParser class, we can create a very simple grammar that combines all proper nouns into a NAME chunk. Then, we can test this on the first tagged sentence of treebank_chunk to compare the results with the previous recipe:

>>> chunker = RegexpParser(r'''
... NAME:
...   {<NNP>+}
... ''')
>>> sub_leaves(chunker.parse(treebank_chunk.tagged_sents()[0]), 'NAME')
[[('Pierre', 'NNP'), ('Vinken', 'NNP')], [('Nov.', 'NNP')]]

Although we get Nov. as a NAME chunk, this isn't a wrong result, as Nov. is the name of a month.

How it works...

The NAME chunker is a simple usage of the RegexpParser class, covered in the Chunking and chinking with regular expressions, Merging and splitting chunks with regular expressions, and Partial parsing with regular expressions recipes. All sequences of NNP tagged words are combined into NAME chunks.

There's more...

If we wanted to be sure to only chunk the names of people, then we can build a PersonChunker class that uses the names corpus for chunking. This class can be found in chunkers.py:

from nltk.chunk import ChunkParserI
from nltk.chunk.util import conlltags2tree
from nltk.corpus import names

class PersonChunker(ChunkParserI):
  def __init__(self):
    self.name_set = set(names.words())
 
  def parse(self, tagged_sent):
    iobs = []
    in_person = False

    for word, tag in tagged_sent:
      if word in self.name_set and in_person:
        iobs.append((word, tag, 'I-PERSON'))
      elif word in self.name_set:
        iobs.append((word, tag, 'B-PERSON'))
        in_person = True
      else:
        iobs.append((word, tag, 'O'))
        in_person = False

    return conlltags2tree(iobs)

The PersonChunker class iterates over the tagged sentence, checking whether each word is in its names_set (constructed from the names corpus). If the current word is in the names_set, then it uses either the B-PERSON or I-PERSON IOB tags, depending on whether the previous word was also in the names_set. Any word that's not in the names_set argument gets the O IOB tag. When complete, the list of IOB tags is converted to a Tree using conlltags2tree(). Using it on the same tagged sentence as before, we get the following result:

>>> from chunkers import PersonChunker
>>> chunker = PersonChunker()
>>> sub_leaves(chunker.parse(treebank_chunk.tagged_sents()[0]), 'PERSON')
[[('Pierre', 'NNP')]]

We no longer get Nov., but we've also lost Vinken, as it is not found in the names corpus. This recipe highlights some of the difficulties of chunk extraction and natural language processing in general:

If you use general patterns, you'll get general results
If you're looking for specific results, you must use specific data
If your specific data is incomplete, your results will be incomplete too

Table of Contents for
Extracting proper noun chunks

Extracting proper noun chunks

How to do it...

How it works...

There's more...

See also

Table of Contents for Extracting proper noun chunks

Create new playlist

Sign In

Sign Up

Extracting proper noun chunks

How to do it...

How it works...

There's more...

See also

Table of Contents for
Extracting proper noun chunks