Creating a categorized chunk corpus reader

NLTK provides a CategorizedPlaintextCorpusReader and CategorizedTaggedCorpusReader class, but there's no categorized corpus reader for chunked corpora. So in this recipe, we're going to make one.

Getting ready

Refer to the earlier recipe, Creating a chunked phrase corpus, for an explanation of ChunkedCorpusReader, and refer to the previous recipe for details on CategorizedPlaintextCorpusReader and CategorizedTaggedCorpusReader, both of which inherit from CategorizedCorpusReader.

How to do it...

We'll create a class called CategorizedChunkedCorpusReader that inherits from both CategorizedCorpusReader and ChunkedCorpusReader. It is heavily based on the CategorizedTaggedCorpusReader class, and also provides three additional methods for getting categorized chunks. The following code is found in catchunked.py:

from nltk.corpus.reader import CategorizedCorpusReader, ChunkedCorpusReader

class CategorizedChunkedCorpusReader(CategorizedCorpusReader, ChunkedCorpusReader):
  def __init__(self, *args, **kwargs):
    CategorizedCorpusReader.__init__(self, kwargs)
    ChunkedCorpusReader.__init__(self, *args, **kwargs)

  def _resolve(self, fileids, categories):
    if fileids is not None and categories is not None:
      raise ValueError('Specify fileids or categories, not both')
    if categories is not None:
      return self.fileids(categories)
    else:
      return fileids

All of the following methods call the corresponding function in ChunkedCorpusReader with the value returned from _resolve(). We'll start with the plain text methods:

  def raw(self, fileids=None, categories=None):
    return ChunkedCorpusReader.raw(self, self._resolve(fileids, categories))

  def words(self, fileids=None, categories=None):
    return ChunkedCorpusReader.words(self, self._resolve(fileids, categories))

  def sents(self, fileids=None, categories=None):
    return ChunkedCorpusReader.sents(self, self._resolve(fileids, categories))

  def paras(self, fileids=None, categories=None):
    return ChunkedCorpusReader.paras(self, self._resolve(fileids, categories))

Next is the code for the tagged text methods:

  def tagged_words(self, fileids=None, categories=None):
    return ChunkedCorpusReader.tagged_words(self, self._resolve(fileids, categories))

  def tagged_sents(self, fileids=None, categories=None):
    return ChunkedCorpusReader.tagged_sents(self, self._resolve(fileids, categories))

  def tagged_paras(self, fileids=None, categories=None):
    return ChunkedCorpusReader.tagged_paras(self, self._resolve(fileids, categories))

And finally, we have code for the chunked methods, which is what we've really been after:

  def chunked_words(self, fileids=None, categories=None):
    return ChunkedCorpusReader.chunked_words(self, self._resolve(fileids, categories))

  def chunked_sents(self, fileids=None, categories=None):
    return ChunkedCorpusReader.chunked_sents(self, self._resolve(fileids, categories))

  def chunked_paras(self, fileids=None, categories=None):
    return ChunkedCorpusReader.chunked_paras(self, self._resolve(fileids, categories))

All these methods together give us a complete CategorizedChunkedCorpusReader class.

How it works...

The CategorizedChunkedCorpusReader class overrides all the ChunkedCorpusReader methods to take a categories argument for locating fileids. These fileids are found with the internal _resolve() function. This _resolve() function makes use of CategorizedCorpusReader.fileids() to return fileids for a given list of categories. If no categories are given, _resolve() just returns the given fileids, which could be None, in which case all the files are read. The initialization of both CategorizedCorpusReader and ChunkedCorpusReader is what makes all this possible. If you look at the code for CategorizedTaggedCorpusReader, you'll see that it's very similar.

The inheritance diagram looks like this:

How it works...

The following is example code for using the treebank corpus. All we're doing is making categories out of the fileids arguments, but the point is that you could use the same techniques to create your own categorized chunk corpus:

>>> import nltk.data
>>> from catchunked import CategorizedChunkedCorpusReader
>>> path = nltk.data.find('corpora/treebank/tagged')
>>> reader = CategorizedChunkedCorpusReader(path, r'wsj_.*.pos', cat_pattern=r'wsj_(.*).pos')
>>> len(reader.categories()) == len(reader.fileids())
True
>>> len(reader.chunked_sents(categories=['0001']))
16

We use nltk.data.find() to search the data directories to get a FileSystemPathPointer class to the treebank corpus. All the treebank tagged files start with wsj_, followed by a number, and end with .pos. The previous code turns that file number into a category.

There's more...

As covered in the Creating a chunked phrase corpus recipe, there's an alternative format and reader for a chunk corpus using IOB tags. To have a categorized corpus of IOB chunks, we have to make a new corpus reader.

Categorized CoNLL chunk corpus reader

The following is the code for the subclass of CategorizedCorpusReader and ConllChunkReader called CategorizedConllChunkCorpusReader. It overrides all methods of ConllCorpusReader that take a fileids argument, so the methods can also take a categories argument. The ConllChunkCorpusReader is just a small subclass of ConllCorpusReader that handles initialization; most of the work is done in ConllCorpusReader. This code can also be found in catchunked.py.

from nltk.corpus.reader import CategorizedCorpusReader, ConllCorpusReader, ConllChunkCorpusReader

class CategorizedConllChunkCorpusReader(CategorizedCorpusReader, ConllChunkCorpusReader):
  def __init__(self, *args, **kwargs):
    CategorizedCorpusReader.__init__(self, kwargs)
    ConllChunkCorpusReader.__init__(self, *args, **kwargs)

  def _resolve(self, fileids, categories):
    if fileids is not None and categories is not None:
      raise ValueError('Specify fileids or categories, not both')
    if categories is not None:
      return self.fileids(categories)
    else:
      return fileids

All the following methods call the corresponding method of ConllCorpusReader with the value returned from _resolve(). We'll start with the plain text methods:

  def raw(self, fileids=None, categories=None):
    return ConllCorpusReader.raw(self, self._resolve(fileids, categories))

  def words(self, fileids=None, categories=None):
    return ConllCorpusReader.words(self, self._resolve(fileids, categories))

  def sents(self, fileids=None, categories=None):
    return ConllCorpusReader.sents(self, self._resolve(fileids, categories))

The ConllCorpusReader class does not recognize paragraphs, so there are no *_paras() methods. Next will be the code for the tagged and chunked methods, as follows:

  def tagged_words(self, fileids=None, categories=None):
    return ConllCorpusReader.tagged_words(self, self._resolve(fileids, categories))
  def tagged_sents(self, fileids=None, categories=None):
    return ConllCorpusReader.tagged_sents(self, self._resolve(fileids, categories))

  def chunked_words(self, fileids=None, categories=None, chunk_types=None):
    return ConllCorpusReader.chunked_words(self, self._resolve(fileids, categories), chunk_types)

  def chunked_sents(self, fileids=None, categories=None, chunk_types=None):
    return ConllCorpusReader.chunked_sents(self, self._resolve(fileids, categories), chunk_types)

For completeness, we must override the following methods of the ConllCorpusReader class:

  def parsed_sents(self, fileids=None, categories=None, pos_in_tree=None):
    return ConllCorpusReader.parsed_sents(
      self, self._resolve(fileids, categories), pos_in_tree)

  def srl_spans(self, fileids=None, categories=None):
    return ConllCorpusReader.srl_spans(self, self._resolve(fileids, categories))

  def srl_instances(self, fileids=None, categories=None, pos_in_tree=None, flatten=True):
    return ConllCorpusReader.srl_instances(self, self._resolve(fileids, categories), pos_in_tree, flatten)

  def iob_words(self, fileids=None, categories=None):
    return ConllCorpusReader.iob_words(self, self._resolve(fileids, categories))

  def iob_sents(self, fileids=None, categories=None):
    return ConllCorpusReader.iob_sents(self, self._resolve(fileids, categories))

The inheritance diagram for this class is as follows:

Categorized CoNLL chunk corpus reader

Here is example code using the conll2000 corpus:

>>> import nltk.data
>>> from catchunked import CategorizedConllChunkCorpusReader
>>> path = nltk.data.find('corpora/conll2000')
>>> reader = CategorizedConllChunkCorpusReader(path, r'.*.txt', ('NP','VP','PP'), cat_pattern=r'(.*).txt')
>>> reader.categories()
['test', 'train']
>>> reader.fileids()
['test.txt', 'train.txt']
>>> len(reader.chunked_sents(categories=['test']))
2012

Like with treebank, we're using the fileids for categories. The ConllChunkCorpusReader class requires a third argument to specify the chunk types. These chunk types are used to parse the IOB tags. As you learned in the Creating a chunked phrase corpus recipe, the conll2000 corpus recognizes the following three chunk types:

  • NP for noun phrases
  • VP for verb phrases
  • PP for prepositional phrases

See also

In the Creating a chunked phrase corpus recipe of this chapter, we covered both the ChunkedCorpusReader and ConllChunkCorpusReader classes. And in the previous recipe, we covered CategorizedPlaintextCorpusReader and CategorizedTaggedCorpusReader, which share the same superclass used by CategorizedChunkedCorpusReader and CategorizedConllChunkReader, that is, CategorizedCorpusReader.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset