Creating a categorized text corpus

If you have a large corpus of text, you might want to categorize it into separate sections. This can be helpful for organization, or for text classification, which is covered in Chapter 7, Text Classification. The brown corpus, for example, has a number of different categories, as shown in the following code:

>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']

In this recipe, we'll learn how to create our own categorized text corpus.

Getting ready

The easiest way to categorize a corpus is to have one file for each category. The following are two excerpts from the movie_reviews corpus:

  • movie_pos.txt:
    the thin red line is flawed but it provokes .
  • movie_neg.txt:
    a big-budget and glossy production can not make up for a lack of spontaneity that permeates their tv show .

With these two files, we'll have two categories: pos and neg.

How to do it...

We'll use the CategorizedPlaintextCorpusReader class, which inherits from both PlaintextCorpusReader and CategorizedCorpusReader. These two superclasses require three arguments: the root directory, the fileids arguments, and a category specification:

>>> from nltk.corpus.reader import CategorizedPlaintextCorpusReader
>>> reader = CategorizedPlaintextCorpusReader('.', r'movie_.*.txt', cat_pattern=r'movie_(w+).txt')
>>> reader.categories()
['neg', 'pos']
>>> reader.fileids(categories=['neg'])
['movie_neg.txt']
>>> reader.fileids(categories=['pos'])
['movie_pos.txt']

How it works...

The first two arguments to CategorizedPlaintextCorpusReader are the root directory and fileids, which are passed on to the PlaintextCorpusReader class to read in the files. The cat_pattern keyword argument is a regular expression for extracting the category names from the fileids arguments. In our case, the category is the part of the fileid argument after movie_ and before .txt. The category must be surrounded by grouping parenthesis.

The cat_pattern keyword is passed to CategorizedCorpusReader, which overrides the common corpus reader functions such as fileids(), words(), sents(), and paras() to accept a categories keyword argument. This way, you could get all the pos sentences by calling reader.sents(categories=['pos']). The CategorizedCorpusReader class also provides the categories() function, which returns a list of all the known categories in the corpus.

The CategorizedPlaintextCorpusReader class is an example of using multiple inheritance to join methods from multiple superclasses, as shown in the following diagram:

How it works...

There's more...

Instead of cat_pattern, you could pass in a cat_map, which is a dictionary mapping a fileid argument to a list of category labels, as shown in the following code:

>>> reader = CategorizedPlaintextCorpusReader('.', r'movie_.*.txt', cat_map={'movie_pos.txt': ['pos'], 'movie_neg.txt': ['neg']})
>>> reader.categories()
['neg', 'pos']

Category file

A third way of specifying categories is to use the cat_file keyword argument to specify a filename containing a mapping of fileid to category. For example, the brown corpus has a file called cats.txt that looks like the following:

ca44 news
cb01 editorial

The reuters corpus has files in multiple categories, and its cats.txt looks like the following:

test/14840 rubber coffee lumber palm-oil veg-oil
test/14841 wheat grain

Categorized tagged corpus reader

The brown corpus reader is actually an instance of CategorizedTaggedCorpusReader, which inherits from CategorizedCorpusReader and TaggedCorpusReader. Just like in CategorizedPlaintextCorpusReader, it overrides all the methods of TaggedCorpusReader to allow a categories argument, so you can call brown.tagged_sents(categories=['news']) to get all the tagged sentences from the news category. You can use the CategorizedTaggedCorpusReader class just like CategorizedPlaintextCorpusReader for your own categorized and tagged text corpora.

Categorized corpora

The movie_reviews corpus reader is an instance of CategorizedPlaintextCorpusReader, as is the reuters corpus reader. But where the movie_reviews corpus only has two categories (neg and pos), reuters has 90 categories. These corpora are often used for training and evaluating classifiers, which will be covered in Chapter 7, Text Classification.

See also

In the next chapter, we'll create a subclass of CategorizedCorpusReader and ChunkedCorpusReader for reading a categorized chunk corpus. Also, see Chapter 7, Text Classification in which we use categorized text for classification.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset