In this chapter, we will cover the following recipes:
Part-of-speech tagging is the process of converting a sentence, in the form of a list of words, into a list of tuples, where each tuple is of the form (word, tag). The tag is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on.
Part-of-speech tagging is a necessary step before chunking, which is covered in Chapter 5, Extracting Chunks. Without the part-of-speech tags, a chunker cannot know how to extract phrases from a sentence. But with part-of-speech tags, you can tell a chunker how to identify phrases based on tag patterns.
You can also use part-of-speech tags for grammar analysis and word sense disambiguation. For example, the word duck could refer to a bird, or it could be a verb indicating a downward motion. Computers cannot know the difference without additional information, such as part-of-speech tags. For more on word sense disambiguation, refer to the URL https://en.wikipedia.org/wiki/Word_sense_disambiguation.
Most of the taggers we'll cover are trainable. They use a list of tagged sentences as their training data, such as what you get from the tagged_sents()
method of a TaggedCorpusReader
class (see the Creating a part-of-speech tagged word corpus recipe in Chapter 3, Creating Custom Corpora, for more details). With these training sentences, the tagger generates an internal model that will tell it how to tag a word. Other taggers use external data sources or match word patterns to choose a tag for a word.
All taggers in NLTK are in the nltk.tag
package and inherit from the TaggerI
base class. TaggerI
requires all subclasses to implement a tag()
method, which takes a list of words as input and returns a list of tagged words as output. TaggerI
also provides an evaluate()
method for evaluating the accuracy of the tagger (covered at the end of the Default tagging recipe). Many taggers can also be combined into a backoff chain, so that if one tagger cannot tag a word, the next tagger is used, and so on.