Creating a part-of-speech tagged word corpus

Part-of-speech tagging is the process of identifying the part-of-speech tag for a word. Most of the time, a tagger must first be trained on a training corpus. How to train and use a tagger is covered in detail in Chapter 4, Part-of-speech Tagging, but first we must know how to create and use a training corpus of part-of-speech tagged words.

Getting ready

The simplest format for a tagged corpus is of the form word/tag. The following is an excerpt from the brown corpus:

The/at-tl expense/nn and/cc time/nn involved/vbn are/ber astronomical/jj ./.

Each word has a tag denoting its part-of-speech. For example, nn refers to a noun, while a tag that starts with vb is a verb.

Note

Different corpora can use different tags to mean the same thing. For example, the treebank corpus uses different tags as compared to the brown corpus, even though both are English text. But both sets of tags can be converted into a universal tagset, described at the end of this recipe.

How to do it...

If you were to put the previous excerpt into a file called brown.pos, you could then create a TaggedCorpusReader class using the following code:

>>> from nltk.corpus.reader import TaggedCorpusReader
>>> reader = TaggedCorpusReader('.', r'.*.pos')
>>> reader.words()
['The', 'expense', 'and', 'time', 'involved', 'are', ...]
>>> reader.tagged_words()
[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ...]
>>> reader.sents()
[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
>>> reader.tagged_sents()
[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]
>>> reader.paras()
[[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]]
>>> reader.tagged_paras()
[[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]]

How it works...

This time, instead of naming the file explicitly, we use a regular expression, r'.*.pos', to match all the files whose names end with .pos. We could have done the same thing as we did with the WordListCorpusReader class, and pass ['brown.pos'] as the second argument, but this way you can see how to include multiple files in a corpus without naming each one explicitly.

The TaggedCorpusReader class provides a number of methods for extracting text from a corpus. First, you can get a list of all words or a list of tagged tokens. A tagged token is simply a tuple of (word, tag). Next, you can get a list of every sentence and also every tagged sentence where the sentence is itself a list of words or tagged tokens. Finally, you can get a list of paragraphs, where each paragraph is a list of sentences and each sentence is a list of words or tagged tokens. The following is an inheritance diagram listing all the major methods:

How it works...

There's more...

All the functions we just demonstrated depend on tokenizers to split the text. The TaggedCorpusReader class tries to have good defaults, but you can customize them by passing in your own tokenizers at the time of initialization.

Customizing the word tokenizer

The default word tokenizer is an instance of nltk.tokenize.WhitespaceTokenizer. If you want to use a different tokenizer, you can pass that in as word_tokenizer, as shown in the following code:

>>> from nltk.tokenize import SpaceTokenizer
>>> reader = TaggedCorpusReader('.', r'.*.pos', word_tokenizer=SpaceTokenizer())
>>> reader.words()
['The', 'expense', 'and', 'time', 'involved', 'are', ...]

Customizing the sentence tokenizer

The default sentence tokenizer is an instance of nltk.tokenize.RegexpTokenize with ' ' to identify the gaps. It assumes that each sentence is on a line all by itself, and individual sentences do not have line breaks. To customize this, you can pass in your own tokenizer as sent_tokenizer, as shown in the following code:

>>> from nltk.tokenize import LineTokenizer
>>> reader = TaggedCorpusReader('.', r'.*.pos', sent_tokenizer=LineTokenizer())
>>> reader.sents()
[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]

Customizing the paragraph block reader

Paragraphs are assumed to be split by blank lines. This is done with the para_block_reader function, which is nltk.corpus.reader.util.read_blankline_block. There are a number of other block reader functions in nltk.corpus.reader.util, whose purpose is to read blocks of text from a stream. Their usage will be covered in more detail later in the Creating a custom corpus view recipe, where we'll create a custom corpus reader.

Customizing the tag separator

If you don't want to use '/' as the word/tag separator, you can pass an alternative string to TaggedCorpusReader for sep. The default is sep='/', but if you want to split words and tags with '|', such as 'word|tag', then you should pass in sep='|'.

Converting tags to a universal tagset

NLTK 3.0 provides a method for converting known tagsets to a universal tagset. A tagset is just a list of part-of-speech tags used by one or more corpora. The universal tagset is a simplified and condensed tagset composed of only 12 part-of-speech tags, as shown in the following table:

Universal tag

Description

VERB

All verbs

NOUN

Common and proper nouns

PRON

Pronouns

ADJ

Adjectives

ADV

Adverbs

ADP

Prepositions and postpositions

CONJ

Conjunctions

DET

Determiners

NUM

Cardinal numbers

PRT

Participles

X

Other

.

Punctuation

Mappings from a known tagset to the universal tagset can be found at nltk_data/taggers/universal_tagset. For example, treebank tag mappings are in nltk_data/taggers/universal_tagset/en-ptb.map.

To map corpus tags to the universal tagset, the corpus reader must be initialized with a known tagset name. Then you pass in tagset='universal' to a method like tagged_words(), as shown in the following code:

>>> reader = TaggedCorpusReader('.', r'.*.pos', tagset='en-brown')
>>> reader.tagged_words(tagset='universal')
[('The', 'DET'), ('expense', 'NOUN'), ('and', 'CONJ'), ...]

Most NLTK tagged corpora are initialized with a known tagset, making conversion easy. The following is an example with the treebank corpus:

>>> from nltk.corpus import treebank
>>> treebank.tagged_words()
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]
>>> treebank.tagged_words(tagset='universal')
[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), …]

If you try to map using an unknown mapping or tagset, every word will be tagged with UNK:

>>> treebank.tagged_words(tagset='brown')
[('Pierre', 'UNK'), ('Vinken', 'UNK'), (',', 'UNK'), ...]

See also

Chapter 4, Part-of-speech Tagging, will cover part-of-speech tags and tagging in much more detail. And for more on tokenizers, see the first three recipes of Chapter 1, Tokenizing Text and WordNet Basics.

In the next recipe, we'll create a chunked phrase corpus, where each phrase is also part-of-speech tagged.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset