Extracting proper noun chunks

A simple way to do named entity extraction is to chunk all proper nouns (tagged with NNP). We can tag these chunks as NAME, since the definition of a proper noun is the name of a person, place, or thing.

How to do it...

Using the RegexpParser class, we can create a very simple grammar that combines all proper nouns into a NAME chunk. Then, we can test this on the first tagged sentence of treebank_chunk to compare the results with the previous recipe:

>>> chunker = RegexpParser(r'''
... NAME:
...   {<NNP>+}
... ''')
>>> sub_leaves(chunker.parse(treebank_chunk.tagged_sents()[0]), 'NAME')
[[('Pierre', 'NNP'), ('Vinken', 'NNP')], [('Nov.', 'NNP')]]

Although we get Nov. as a NAME chunk, this isn't a wrong result, as Nov. is the name of a month.

How it works...

The NAME chunker is a simple usage of the RegexpParser class, covered in the Chunking and chinking with regular expressions, Merging and splitting chunks with regular expressions, and Partial parsing with regular expressions recipes. All sequences of NNP tagged words are combined into NAME chunks.

There's more...

If we wanted to be sure to only chunk the names of people, then we can build a PersonChunker class that uses the names corpus for chunking. This class can be found in chunkers.py:

from nltk.chunk import ChunkParserI
from nltk.chunk.util import conlltags2tree
from nltk.corpus import names

class PersonChunker(ChunkParserI):
  def __init__(self):
    self.name_set = set(names.words())
 
  def parse(self, tagged_sent):
    iobs = []
    in_person = False

    for word, tag in tagged_sent:
      if word in self.name_set and in_person:
        iobs.append((word, tag, 'I-PERSON'))
      elif word in self.name_set:
        iobs.append((word, tag, 'B-PERSON'))
        in_person = True
      else:
        iobs.append((word, tag, 'O'))
        in_person = False

    return conlltags2tree(iobs)

The PersonChunker class iterates over the tagged sentence, checking whether each word is in its names_set (constructed from the names corpus). If the current word is in the names_set, then it uses either the B-PERSON or I-PERSON IOB tags, depending on whether the previous word was also in the names_set. Any word that's not in the names_set argument gets the O IOB tag. When complete, the list of IOB tags is converted to a Tree using conlltags2tree(). Using it on the same tagged sentence as before, we get the following result:

>>> from chunkers import PersonChunker
>>> chunker = PersonChunker()
>>> sub_leaves(chunker.parse(treebank_chunk.tagged_sents()[0]), 'PERSON')
[[('Pierre', 'NNP')]]

We no longer get Nov., but we've also lost Vinken, as it is not found in the names corpus. This recipe highlights some of the difficulties of chunk extraction and natural language processing in general:

  • If you use general patterns, you'll get general results
  • If you're looking for specific results, you must use specific data
  • If your specific data is incomplete, your results will be incomplete too

See also

The previous recipe defines the sub_leaves() function used to show the found chunks. In the next recipe, we'll cover how to find LOCATION chunks based on the gazetteers corpus.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset