Part-of-speech tagging is the process of identifying the part-of-speech tag for a word. Most of the time, a tagger must first be trained on a training corpus. How to train and use a tagger is covered in detail in Chapter 4, Part-of-speech Tagging, but first we must know how to create and use a training corpus of part-of-speech tagged words.
The simplest format for a tagged corpus is of the form word/tag. The following is an excerpt from the brown
corpus:
The/at-tl expense/nn and/cc time/nn involved/vbn are/ber astronomical/jj ./.
Each word has a tag denoting its part-of-speech. For example, nn
refers to a noun, while a tag that starts with vb
is a verb.
If you were to put the previous excerpt into a file called brown.pos
, you could then create a TaggedCorpusReader
class using the following code:
>>> from nltk.corpus.reader import TaggedCorpusReader >>> reader = TaggedCorpusReader('.', r'.*.pos') >>> reader.words() ['The', 'expense', 'and', 'time', 'involved', 'are', ...] >>> reader.tagged_words() [('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ...] >>> reader.sents() [['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']] >>> reader.tagged_sents() [[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]] >>> reader.paras() [[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]] >>> reader.tagged_paras() [[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]]
This time, instead of naming the file explicitly, we use a regular expression, r'.*.pos'
, to match all the files whose names end with .pos
. We could have done the same thing as we did with the WordListCorpusReader
class, and pass ['brown.pos']
as the second argument, but this way you can see how to include multiple files in a corpus without naming each one explicitly.
The TaggedCorpusReader
class provides a number of methods for extracting text from a corpus. First, you can get a list of all words or a list of tagged tokens. A tagged token is simply a tuple of (word, tag)
. Next, you can get a list of every sentence and also every tagged sentence where the sentence is itself a list of words or tagged tokens. Finally, you can get a list of paragraphs, where each paragraph is a list of sentences and each sentence is a list of words or tagged tokens. The following is an inheritance diagram listing all the major methods:
All the functions we just demonstrated depend on tokenizers to split the text. The TaggedCorpusReader
class tries to have good defaults, but you can customize them by passing in your own tokenizers at the time of initialization.
The default word tokenizer is an instance of nltk.tokenize.WhitespaceTokenizer
. If you want to use a different tokenizer, you can pass that in as word_tokenizer
, as shown in the following code:
>>> from nltk.tokenize import SpaceTokenizer >>> reader = TaggedCorpusReader('.', r'.*.pos', word_tokenizer=SpaceTokenizer()) >>> reader.words() ['The', 'expense', 'and', 'time', 'involved', 'are', ...]
The default sentence tokenizer is an instance of nltk.tokenize.RegexpTokenize
with '
'
to identify the gaps. It assumes that each sentence is on a line all by itself, and individual sentences do not have line breaks. To customize this, you can pass in your own tokenizer as sent_tokenizer
, as shown in the following code:
>>> from nltk.tokenize import LineTokenizer >>> reader = TaggedCorpusReader('.', r'.*.pos', sent_tokenizer=LineTokenizer()) >>> reader.sents() [['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
Paragraphs are assumed to be split by blank lines. This is done with the para_block_reader
function, which is nltk.corpus.reader.util.read_blankline_block
. There are a number of other block reader functions in nltk.corpus.reader.util
, whose purpose is to read blocks of text from a stream. Their usage will be covered in more detail later in the Creating a custom corpus view recipe, where we'll create a custom corpus reader.
If you don't want to use '/'
as the word/tag separator, you can pass an alternative string to TaggedCorpusReader
for sep
. The default is sep='/'
, but if you want to split words and tags with '|'
, such as 'word|tag'
, then you should pass in sep='|'
.
NLTK 3.0 provides a method for converting known tagsets to a universal tagset. A tagset is just a list of part-of-speech tags used by one or more corpora. The universal tagset is a simplified and condensed tagset composed of only 12 part-of-speech tags, as shown in the following table:
Universal tag |
Description |
---|---|
VERB |
All verbs |
NOUN |
Common and proper nouns |
PRON |
Pronouns |
ADJ |
Adjectives |
ADV |
Adverbs |
ADP |
Prepositions and postpositions |
CONJ |
Conjunctions |
DET |
Determiners |
NUM |
Cardinal numbers |
PRT |
Participles |
X |
Other |
. |
Punctuation |
Mappings from a known tagset to the universal tagset can be found at nltk_data/taggers/universal_tagset
. For example, treebank
tag mappings are in nltk_data/taggers/universal_tagset/en-ptb.map
.
To map corpus tags to the universal tagset, the corpus reader must be initialized with a known tagset name. Then you pass in tagset='universal'
to a method like tagged_words()
, as shown in the following code:
>>> reader = TaggedCorpusReader('.', r'.*.pos', tagset='en-brown') >>> reader.tagged_words(tagset='universal') [('The', 'DET'), ('expense', 'NOUN'), ('and', 'CONJ'), ...]
Most NLTK tagged corpora are initialized with a known tagset, making conversion easy. The following is an example with the treebank
corpus:
>>> from nltk.corpus import treebank >>> treebank.tagged_words() [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...] >>> treebank.tagged_words(tagset='universal') [('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), …]
If you try to map using an unknown mapping or tagset, every word will be tagged with UNK:
>>> treebank.tagged_words(tagset='brown') [('Pierre', 'UNK'), ('Vinken', 'UNK'), (',', 'UNK'), ...]
Chapter 4, Part-of-speech Tagging, will cover part-of-speech tags and tagging in much more detail. And for more on tokenizers, see the first three recipes of Chapter 1, Tokenizing Text and WordNet Basics.
In the next recipe, we'll create a chunked phrase corpus, where each phrase is also part-of-speech tagged.