Chapter 4. Part-of-speech Tagging

In this chapter, we will cover the following recipes:

  • Default tagging
  • Training a unigram part-of-speech tagger
  • Combining taggers with backoff tagging
  • Training and combining ngram taggers
  • Creating a model of likely word tags
  • Tagging with regular expressions
  • Affix tagging
  • Training a Brill tagger
  • Training the TnT tagger
  • Using WordNet for tagging
  • Tagging proper names
  • Classifier-based tagging
  • Training a tagger with NLTK-Trainer

Introduction

Part-of-speech tagging is the process of converting a sentence, in the form of a list of words, into a list of tuples, where each tuple is of the form (word, tag). The tag is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on.

Part-of-speech tagging is a necessary step before chunking, which is covered in Chapter 5, Extracting Chunks. Without the part-of-speech tags, a chunker cannot know how to extract phrases from a sentence. But with part-of-speech tags, you can tell a chunker how to identify phrases based on tag patterns.

You can also use part-of-speech tags for grammar analysis and word sense disambiguation. For example, the word duck could refer to a bird, or it could be a verb indicating a downward motion. Computers cannot know the difference without additional information, such as part-of-speech tags. For more on word sense disambiguation, refer to the URL https://en.wikipedia.org/wiki/Word_sense_disambiguation.

Most of the taggers we'll cover are trainable. They use a list of tagged sentences as their training data, such as what you get from the tagged_sents() method of a TaggedCorpusReader class (see the Creating a part-of-speech tagged word corpus recipe in Chapter 3, Creating Custom Corpora, for more details). With these training sentences, the tagger generates an internal model that will tell it how to tag a word. Other taggers use external data sources or match word patterns to choose a tag for a word.

All taggers in NLTK are in the nltk.tag package and inherit from the TaggerI base class. TaggerI requires all subclasses to implement a tag() method, which takes a list of words as input and returns a list of tagged words as output. TaggerI also provides an evaluate() method for evaluating the accuracy of the tagger (covered at the end of the Default tagging recipe). Many taggers can also be combined into a backoff chain, so that if one tagger cannot tag a word, the next tagger is used, and so on.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset