NLTK provides a CategorizedPlaintextCorpusReader
and CategorizedTaggedCorpusReader
class, but there's no categorized corpus reader for chunked corpora. So in this recipe, we're going to make one.
Refer to the earlier recipe, Creating a chunked phrase corpus, for an explanation of ChunkedCorpusReader
, and refer to the previous recipe for details on CategorizedPlaintextCorpusReader
and CategorizedTaggedCorpusReader
, both of which inherit from CategorizedCorpusReader
.
We'll create a class called CategorizedChunkedCorpusReader
that inherits from both CategorizedCorpusReader
and ChunkedCorpusReader
. It is heavily based on the CategorizedTaggedCorpusReader
class, and also provides three additional methods for getting categorized chunks. The following code is found in catchunked.py
:
from nltk.corpus.reader import CategorizedCorpusReader, ChunkedCorpusReader class CategorizedChunkedCorpusReader(CategorizedCorpusReader, ChunkedCorpusReader): def __init__(self, *args, **kwargs): CategorizedCorpusReader.__init__(self, kwargs) ChunkedCorpusReader.__init__(self, *args, **kwargs) def _resolve(self, fileids, categories): if fileids is not None and categories is not None: raise ValueError('Specify fileids or categories, not both') if categories is not None: return self.fileids(categories) else: return fileids
All of the following methods call the corresponding function in ChunkedCorpusReader
with the value returned from _resolve()
. We'll start with the plain text methods:
def raw(self, fileids=None, categories=None): return ChunkedCorpusReader.raw(self, self._resolve(fileids, categories)) def words(self, fileids=None, categories=None): return ChunkedCorpusReader.words(self, self._resolve(fileids, categories)) def sents(self, fileids=None, categories=None): return ChunkedCorpusReader.sents(self, self._resolve(fileids, categories)) def paras(self, fileids=None, categories=None): return ChunkedCorpusReader.paras(self, self._resolve(fileids, categories))
Next is the code for the tagged text methods:
def tagged_words(self, fileids=None, categories=None): return ChunkedCorpusReader.tagged_words(self, self._resolve(fileids, categories)) def tagged_sents(self, fileids=None, categories=None): return ChunkedCorpusReader.tagged_sents(self, self._resolve(fileids, categories)) def tagged_paras(self, fileids=None, categories=None): return ChunkedCorpusReader.tagged_paras(self, self._resolve(fileids, categories))
And finally, we have code for the chunked methods, which is what we've really been after:
def chunked_words(self, fileids=None, categories=None): return ChunkedCorpusReader.chunked_words(self, self._resolve(fileids, categories)) def chunked_sents(self, fileids=None, categories=None): return ChunkedCorpusReader.chunked_sents(self, self._resolve(fileids, categories)) def chunked_paras(self, fileids=None, categories=None): return ChunkedCorpusReader.chunked_paras(self, self._resolve(fileids, categories))
All these methods together give us a complete CategorizedChunkedCorpusReader
class.
The CategorizedChunkedCorpusReader
class overrides all the ChunkedCorpusReader
methods to take a categories
argument for locating fileids
. These fileids
are found with the internal _resolve()
function. This _resolve()
function makes use of CategorizedCorpusReader.fileids()
to return fileids
for a given list of categories. If no categories are given, _resolve()
just returns the given fileids
, which could be None
, in which case all the files are read. The initialization of both CategorizedCorpusReader
and ChunkedCorpusReader
is what makes all this possible. If you look at the code for CategorizedTaggedCorpusReader
, you'll see that it's very similar.
The inheritance diagram looks like this:
The following is example code for using the treebank
corpus. All we're doing is making categories out of the fileids
arguments, but the point is that you could use the same techniques to create your own categorized chunk corpus:
>>> import nltk.data >>> from catchunked import CategorizedChunkedCorpusReader >>> path = nltk.data.find('corpora/treebank/tagged') >>> reader = CategorizedChunkedCorpusReader(path, r'wsj_.*.pos', cat_pattern=r'wsj_(.*).pos') >>> len(reader.categories()) == len(reader.fileids()) True >>> len(reader.chunked_sents(categories=['0001'])) 16
We use nltk.data.find()
to search the data directories to get a FileSystemPathPointer
class to the treebank
corpus. All the treebank
tagged files start with wsj_
, followed by a number, and end with .pos
. The previous code turns that file number into a category.
As covered in the Creating a chunked phrase corpus recipe, there's an alternative format and reader for a chunk corpus using IOB tags. To have a categorized corpus of IOB chunks, we have to make a new corpus reader.
The following is the code for the subclass of CategorizedCorpusReader
and ConllChunkReader
called CategorizedConllChunkCorpusReader
. It overrides all methods of ConllCorpusReader
that take a fileids
argument, so the methods can also take a categories
argument. The ConllChunkCorpusReader
is just a small subclass of ConllCorpusReader
that handles initialization; most of the work is done in ConllCorpusReader
. This code can also be found in catchunked.py
.
from nltk.corpus.reader import CategorizedCorpusReader, ConllCorpusReader, ConllChunkCorpusReader class CategorizedConllChunkCorpusReader(CategorizedCorpusReader, ConllChunkCorpusReader): def __init__(self, *args, **kwargs): CategorizedCorpusReader.__init__(self, kwargs) ConllChunkCorpusReader.__init__(self, *args, **kwargs) def _resolve(self, fileids, categories): if fileids is not None and categories is not None: raise ValueError('Specify fileids or categories, not both') if categories is not None: return self.fileids(categories) else: return fileids
All the following methods call the corresponding method of ConllCorpusReader
with the value returned from _resolve()
. We'll start with the plain text methods:
def raw(self, fileids=None, categories=None): return ConllCorpusReader.raw(self, self._resolve(fileids, categories)) def words(self, fileids=None, categories=None): return ConllCorpusReader.words(self, self._resolve(fileids, categories)) def sents(self, fileids=None, categories=None): return ConllCorpusReader.sents(self, self._resolve(fileids, categories))
The ConllCorpusReader
class does not recognize paragraphs, so there are no *_paras()
methods. Next will be the code for the tagged and chunked methods, as follows:
def tagged_words(self, fileids=None, categories=None): return ConllCorpusReader.tagged_words(self, self._resolve(fileids, categories)) def tagged_sents(self, fileids=None, categories=None): return ConllCorpusReader.tagged_sents(self, self._resolve(fileids, categories)) def chunked_words(self, fileids=None, categories=None, chunk_types=None): return ConllCorpusReader.chunked_words(self, self._resolve(fileids, categories), chunk_types) def chunked_sents(self, fileids=None, categories=None, chunk_types=None): return ConllCorpusReader.chunked_sents(self, self._resolve(fileids, categories), chunk_types)
For completeness, we must override the following methods of the ConllCorpusReader
class:
def parsed_sents(self, fileids=None, categories=None, pos_in_tree=None): return ConllCorpusReader.parsed_sents( self, self._resolve(fileids, categories), pos_in_tree) def srl_spans(self, fileids=None, categories=None): return ConllCorpusReader.srl_spans(self, self._resolve(fileids, categories)) def srl_instances(self, fileids=None, categories=None, pos_in_tree=None, flatten=True): return ConllCorpusReader.srl_instances(self, self._resolve(fileids, categories), pos_in_tree, flatten) def iob_words(self, fileids=None, categories=None): return ConllCorpusReader.iob_words(self, self._resolve(fileids, categories)) def iob_sents(self, fileids=None, categories=None): return ConllCorpusReader.iob_sents(self, self._resolve(fileids, categories))
The inheritance diagram for this class is as follows:
Here is example code using the conll2000
corpus:
>>> import nltk.data >>> from catchunked import CategorizedConllChunkCorpusReader >>> path = nltk.data.find('corpora/conll2000') >>> reader = CategorizedConllChunkCorpusReader(path, r'.*.txt', ('NP','VP','PP'), cat_pattern=r'(.*).txt') >>> reader.categories() ['test', 'train'] >>> reader.fileids() ['test.txt', 'train.txt'] >>> len(reader.chunked_sents(categories=['test'])) 2012
Like with treebank
, we're using the fileids
for categories. The ConllChunkCorpusReader
class requires a third argument to specify the chunk types. These chunk types are used to parse the IOB tags. As you learned in the Creating a chunked phrase corpus recipe, the conll2000
corpus recognizes the following three chunk types:
In the Creating a chunked phrase corpus recipe of this chapter, we covered both the ChunkedCorpusReader
and ConllChunkCorpusReader
classes. And in the previous recipe, we covered CategorizedPlaintextCorpusReader
and CategorizedTaggedCorpusReader
, which share the same superclass used by CategorizedChunkedCorpusReader
and CategorizedConllChunkReader
, that is, CategorizedCorpusReader
.