A corpus view is a class wrapper around a corpus file that reads in blocks of tokens as needed. Its purpose is to provide a view into a file without reading the whole file at once (since corpus files can often be quite large). If the corpus readers included by NLTK already meet all your needs, then you do not have to know anything about corpus views. But, if you have a custom file format that needs special handling, this recipe will show you how to create and use a custom corpus view. The main corpus view class is StreamBackedCorpusView
, which opens a single file as a stream, and maintains an internal cache of blocks it has read.
Blocks of tokens are read in with a block reader function. A block can be any piece of text, such as a paragraph or a line, and tokens are parts of a block, such as individual words. In the Creating a part-of-speech tagged word corpus recipe, we discussed the default para_block_reader
function of the TaggedCorpusReader
class, which reads lines from a file until it finds a blank line, then returns those lines as a single paragraph token. The actual block reader function is nltk.corpus.reader.util.read_blankline_block
. The TaggedCorpusReader
class passes this block reader function into a TaggedCorpusView
class whenever it needs to read blocks from a file. The TaggedCorpusView
class is a subclass of StreamBackedCorpusView
that knows to split paragraphs of word/tag into (word, tag)
tuples.
We'll start with the simple case of a plain text file with a heading that should be ignored by the corpus reader. Let's make a file called heading_text.txt
that looks like this:
A simple heading Here is the actual text for the corpus. Paragraphs are split by blanklines. This is the 3rd paragraph.
Normally, we'd use the PlaintextCorpusReader
class, but by default it will treat A simple heading
as the first paragraph. To ignore this heading, we need to subclass the PlaintextCorpusReader
class so we can override its CorpusView
class variable with our own StreamBackedCorpusView
subclass. The following is the code found in corpus.py
:
from nltk.corpus.reader import PlaintextCorpusReader from nltk.corpus.reader.util import StreamBackedCorpusView class IgnoreHeadingCorpusView(StreamBackedCorpusView): def __init__(self, *args, **kwargs): StreamBackedCorpusView.__init__(self, *args, **kwargs) # open self._stream self._open() # skip the heading block self.read_block(self._stream) # reset the start position to the current position in the stream self._filepos = [self._stream.tell()] class IgnoreHeadingCorpusReader(PlaintextCorpusReader): CorpusView = IgnoreHeadingCorpusView
To demonstrate that this works as expected, here is code showing that the default PlaintextCorpusReader
class finds four paragraphs, while our IgnoreHeadingCorpusReader
class only has three paragraphs:
>>> from nltk.corpus.reader import PlaintextCorpusReader >>> plain = PlaintextCorpusReader('.', ['heading_text.txt']) >>> len(plain.paras()) 4 >>> from corpus import IgnoreHeadingCorpusReader >>> reader = IgnoreHeadingCorpusReader('.', ['heading_text.txt']) >>> len(reader.paras()) 3
The PlaintextCorpusReader
class by design has a CorpusView
class variable that can be overridden by subclasses. So we do just that, and make our IgnoreHeadingCorpusView
class the CorpusView
class variable.
The IgnoreHeadingCorpusView
class is a subclass of StreamBackedCorpusView
that does the following on initialization:
self._open()
. This function is defined by StreamBackedCorpusView
, and sets the internal instance variable self._stream
to the opened file.read_blankline_block()
, which then reads the heading as a paragraph, and moves the stream's file position forward to the next block.self._stream
. The self._filepos
variable is an internal index of where each block is in the file.The following is a diagram illustrating the relationships between the classes:
Corpus views can get a lot fancier and more complicated, but the core concept is the same: read blocks from a stream to return a list of tokens. There are a number of block readers provided in nltk.corpus.reader.util
, but you can always create your own. If you do want to define your own block reader function, then you have two choices on how to implement it:
StreamBackedCorpusView
as block_reader
. This is a good option if your block reader is fairly simple, reusable, and doesn't require any outside variables or configuration.StreamBackedCorpusView
and override the read_block()
method. This is what many custom corpus views do because the block reading is highly specialized and requires additional functions and configuration, usually provided by the corpus reader when the corpus view is initialized.The following is a survey of most of the included block readers in nltk.corpus.reader.util
. Unless otherwise mentioned, each block reader function takes a single argument: the stream
argument to read from:
read_whitespace_block()
: This will read 20 lines from the stream, splitting each line into tokens by whitespace.read_wordpunct_block()
: This reads 20 lines from the stream, splitting each line using nltk.tokenize.wordpunct_tokenize()
.read_line_block()
: This reads 20 lines from the stream and returns them as a list, with each line as a token.read_blankline_block()
: This will read lines from the stream until it finds a blank line. It will then return a single token of all lines found combined into a single string.read_regexp_block()
: This takes two additional arguments, which must be regular expressions that can be passed to re.match()
: start_re
and end_re
. The start_re
variable matches the starting line of a block, and end_re
matches the ending line of the block. The end_re
variable defaults to None
, in which case the block will end as soon as a new start_re
match is found. The return value is a single token of all lines in the block joined into a single string.If you want to have a corpus of pickled objects, you can use the PickleCorpusView
, a subclass of StreamBackedCorpusView
, found in nltk.corpus.reader.util
. A file consists of blocks of pickled objects, and can be created with the PickleCorpusView.write()
class method, which takes a sequence of objects and an output file, then pickles each object using pickle.dump()
and writes it to the file. It overrides the read_block()
method to return a list of unpickled objects from the stream, using pickle.load()
.
Also found in nltk.corpus.reader.util
is the ConcatenatedCorpusView
class. This class is useful if you have multiple files that you want a corpus reader to treat as a single file. A ConcatenatedCorpusView
class is created by giving it a list of corpus_views
, which are then iterated over as if they were a single view.