If you have a large corpus of text, you might want to categorize it into separate sections. This can be helpful for organization, or for text classification, which is covered in Chapter 7, Text Classification. The brown
corpus, for example, has a number of different categories, as shown in the following code:
>>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
In this recipe, we'll learn how to create our own categorized text corpus.
The easiest way to categorize a corpus is to have one file for each category. The following are two excerpts from the movie_reviews
corpus:
the thin red line is flawed but it provokes .
a big-budget and glossy production can not make up for a lack of spontaneity that permeates their tv show .
With these two files, we'll have two categories: pos
and neg
.
We'll use the CategorizedPlaintextCorpusReader
class, which inherits from both PlaintextCorpusReader
and CategorizedCorpusReader
. These two superclasses require three arguments: the root directory, the fileids
arguments, and a category specification:
>>> from nltk.corpus.reader import CategorizedPlaintextCorpusReader >>> reader = CategorizedPlaintextCorpusReader('.', r'movie_.*.txt', cat_pattern=r'movie_(w+).txt') >>> reader.categories() ['neg', 'pos'] >>> reader.fileids(categories=['neg']) ['movie_neg.txt'] >>> reader.fileids(categories=['pos']) ['movie_pos.txt']
The first two arguments to CategorizedPlaintextCorpusReader
are the root directory and fileids
, which are passed on to the PlaintextCorpusReader
class to read in the files. The cat_pattern
keyword argument is a regular expression for extracting the category names from the fileids
arguments. In our case, the category is the part of the fileid
argument after movie_
and before .txt
. The category must be surrounded by grouping parenthesis.
The cat_pattern
keyword is passed to CategorizedCorpusReader
, which overrides the common corpus reader functions such as fileids()
, words()
, sents()
, and paras()
to accept a categories
keyword argument. This way, you could get all the pos
sentences by calling reader.sents(categories=['pos'])
. The CategorizedCorpusReader
class also provides the categories()
function, which returns a list of all the known categories in the corpus.
The CategorizedPlaintextCorpusReader
class is an example of using multiple inheritance to join methods from multiple superclasses, as shown in the following diagram:
Instead of cat_pattern
, you could pass in a cat_map
, which is a dictionary mapping a fileid
argument to a list of category labels, as shown in the following code:
>>> reader = CategorizedPlaintextCorpusReader('.', r'movie_.*.txt', cat_map={'movie_pos.txt': ['pos'], 'movie_neg.txt': ['neg']}) >>> reader.categories() ['neg', 'pos']
A third way of specifying categories is to use the cat_file
keyword argument to specify a filename containing a mapping of fileid
to category. For example, the brown
corpus has a file called cats.txt
that looks like the following:
ca44 news cb01 editorial
The reuters
corpus has files in multiple categories, and its cats.txt
looks like the following:
test/14840 rubber coffee lumber palm-oil veg-oil test/14841 wheat grain
The brown
corpus reader is actually an instance of CategorizedTaggedCorpusReader
, which inherits from CategorizedCorpusReader
and TaggedCorpusReader
. Just like in CategorizedPlaintextCorpusReader
, it overrides all the methods of TaggedCorpusReader
to allow a categories
argument, so you can call brown.tagged_sents(categories=['news'])
to get all the tagged sentences from the news
category. You can use the CategorizedTaggedCorpusReader
class just like CategorizedPlaintextCorpusReader
for your own categorized and tagged text corpora.
The movie_reviews
corpus reader is an instance of CategorizedPlaintextCorpusReader
, as is the reuters
corpus reader. But where the movie_reviews
corpus only has two categories (neg
and pos
), reuters
has 90 categories. These corpora are often used for training and evaluating classifiers, which will be covered in Chapter 7, Text Classification.
In the next chapter, we'll create a subclass of CategorizedCorpusReader
and ChunkedCorpusReader
for reading a categorized chunk corpus. Also, see Chapter 7, Text Classification in which we use categorized text for classification.