Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Creating a part-of-speech tagged word corpus

Part-of-speech tagging is the process of identifying the part-of-speech tag for a word. Most of the time, a tagger must first be trained on a training corpus. How to train and use a tagger is covered in detail in Chapter 4, Part-of-speech Tagging, but first we must know how to create and use a training corpus of part-of-speech tagged words.

Getting ready

The simplest format for a tagged corpus is of the form word/tag. The following is an excerpt from the brown corpus:

The/at-tl expense/nn and/cc time/nn involved/vbn are/ber astronomical/jj ./.

Each word has a tag denoting its part-of-speech. For example, nn refers to a noun, while a tag that starts with vb is a verb.

Note

Different corpora can use different tags to mean the same thing. For example, the treebank corpus uses different tags as compared to the brown corpus, even though both are English text. But both sets of tags can be converted into a universal tagset, described at the end of this recipe.

How to do it...

If you were to put the previous excerpt into a file called brown.pos, you could then create a TaggedCorpusReader class using the following code:

>>> from nltk.corpus.reader import TaggedCorpusReader
>>> reader = TaggedCorpusReader('.', r'.*.pos')
>>> reader.words()
['The', 'expense', 'and', 'time', 'involved', 'are', ...]
>>> reader.tagged_words()
[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ...]
>>> reader.sents()
[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]
>>> reader.tagged_sents()
[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]
>>> reader.paras()
[[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]]
>>> reader.tagged_paras()
[[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'), ('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.', '.')]]]

How it works...

This time, instead of naming the file explicitly, we use a regular expression, r'.*.pos', to match all the files whose names end with .pos. We could have done the same thing as we did with the WordListCorpusReader class, and pass ['brown.pos'] as the second argument, but this way you can see how to include multiple files in a corpus without naming each one explicitly.

The TaggedCorpusReader class provides a number of methods for extracting text from a corpus. First, you can get a list of all words or a list of tagged tokens. A tagged token is simply a tuple of (word, tag). Next, you can get a list of every sentence and also every tagged sentence where the sentence is itself a list of words or tagged tokens. Finally, you can get a list of paragraphs, where each paragraph is a list of sentences and each sentence is a list of words or tagged tokens. The following is an inheritance diagram listing all the major methods:

There's more...

All the functions we just demonstrated depend on tokenizers to split the text. The TaggedCorpusReader class tries to have good defaults, but you can customize them by passing in your own tokenizers at the time of initialization.

Customizing the word tokenizer

The default word tokenizer is an instance of nltk.tokenize.WhitespaceTokenizer. If you want to use a different tokenizer, you can pass that in as word_tokenizer, as shown in the following code:

>>> from nltk.tokenize import SpaceTokenizer
>>> reader = TaggedCorpusReader('.', r'.*.pos', word_tokenizer=SpaceTokenizer())
>>> reader.words()
['The', 'expense', 'and', 'time', 'involved', 'are', ...]

Customizing the sentence tokenizer

The default sentence tokenizer is an instance of nltk.tokenize.RegexpTokenize with ' ' to identify the gaps. It assumes that each sentence is on a line all by itself, and individual sentences do not have line breaks. To customize this, you can pass in your own tokenizer as sent_tokenizer, as shown in the following code:

>>> from nltk.tokenize import LineTokenizer
>>> reader = TaggedCorpusReader('.', r'.*.pos', sent_tokenizer=LineTokenizer())
>>> reader.sents()
[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical', '.']]

Customizing the paragraph block reader

Paragraphs are assumed to be split by blank lines. This is done with the para_block_reader function, which is nltk.corpus.reader.util.read_blankline_block. There are a number of other block reader functions in nltk.corpus.reader.util, whose purpose is to read blocks of text from a stream. Their usage will be covered in more detail later in the Creating a custom corpus view recipe, where we'll create a custom corpus reader.

Customizing the tag separator

If you don't want to use '/' as the word/tag separator, you can pass an alternative string to TaggedCorpusReader for sep. The default is sep='/', but if you want to split words and tags with '|', such as 'word|tag', then you should pass in sep='|'.

Converting tags to a universal tagset

NLTK 3.0 provides a method for converting known tagsets to a universal tagset. A tagset is just a list of part-of-speech tags used by one or more corpora. The universal tagset is a simplified and condensed tagset composed of only 12 part-of-speech tags, as shown in the following table:

Universal tag	Description
VERB	All verbs
NOUN	Common and proper nouns
PRON	Pronouns
ADJ	Adjectives
ADV	Adverbs
ADP	Prepositions and postpositions
CONJ	Conjunctions
DET	Determiners
NUM	Cardinal numbers
PRT	Participles
X	Other
.	Punctuation

Mappings from a known tagset to the universal tagset can be found at nltk_data/taggers/universal_tagset. For example, treebank tag mappings are in nltk_data/taggers/universal_tagset/en-ptb.map.

To map corpus tags to the universal tagset, the corpus reader must be initialized with a known tagset name. Then you pass in tagset='universal' to a method like tagged_words(), as shown in the following code:

>>> reader = TaggedCorpusReader('.', r'.*.pos', tagset='en-brown')
>>> reader.tagged_words(tagset='universal')
[('The', 'DET'), ('expense', 'NOUN'), ('and', 'CONJ'), ...]

Most NLTK tagged corpora are initialized with a known tagset, making conversion easy. The following is an example with the treebank corpus:

>>> from nltk.corpus import treebank
>>> treebank.tagged_words()
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]
>>> treebank.tagged_words(tagset='universal')
[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), …]

If you try to map using an unknown mapping or tagset, every word will be tagged with UNK:

>>> treebank.tagged_words(tagset='brown')
[('Pierre', 'UNK'), ('Vinken', 'UNK'), (',', 'UNK'), ...]

Table of Contents for
Creating a part-of-speech tagged word corpus

Creating a part-of-speech tagged word corpus

Getting ready

Note

How to do it...

How it works...

There's more...

Customizing the word tokenizer

Customizing the sentence tokenizer

Customizing the paragraph block reader

Customizing the tag separator

Converting tags to a universal tagset

See also

Table of Contents for Creating a part-of-speech tagged word corpus

Create new playlist

Sign In

Sign Up

Creating a part-of-speech tagged word corpus

Getting ready

Note

How to do it...

How it works...

There's more...

Customizing the word tokenizer

Customizing the sentence tokenizer

Customizing the paragraph block reader

Customizing the tag separator

Converting tags to a universal tagset

See also

Table of Contents for
Creating a part-of-speech tagged word corpus