Chapter 5. Extracting Chunks

In this chapter, we will cover the following recipes:

  • Chunking and chinking with regular expressions
  • Merging and splitting chunks with regular expressions
  • Expanding and removing chunks with regular expressions
  • Partial parsing with regular expressions
  • Training a tagger-based chunker
  • Classification-based chunking
  • Extracting named entities
  • Extracting proper noun chunks
  • Extracting location chunks
  • Training a named entity chunker
  • Training a chunker with NLTK-Trainer

Introduction

Chunk extraction, or partial parsing, is the process of extracting short phrases from a part-of-speech tagged sentence. This is different from full parsing in that we're interested in standalone chunks, or phrases, instead of full parse trees (for more on parse trees, see https://en.wikipedia.org/wiki/Parse_tree). The idea is that meaningful phrases can be extracted from a sentence by looking for particular patterns of part-of-speech tags.

As in Chapter 4, Part-of-speech Tagging, we'll be using the Penn Treebank corpus for basic training and testing chunk extraction. We'll also be using the CoNLL2000 corpus as it has a simpler and more flexible format that supports multiple chunk types (for more details on the conll2000 corpus and IOB tags, see the Creating a chunked phrase corpus recipe in Chapter 3, Creating Custom Corpora).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset