Creating a custom corpus view

A corpus view is a class wrapper around a corpus file that reads in blocks of tokens as needed. Its purpose is to provide a view into a file without reading the whole file at once (since corpus files can often be quite large). If the corpus readers included by NLTK already meet all your needs, then you do not have to know anything about corpus views. But, if you have a custom file format that needs special handling, this recipe will show you how to create and use a custom corpus view. The main corpus view class is StreamBackedCorpusView, which opens a single file as a stream, and maintains an internal cache of blocks it has read.

Blocks of tokens are read in with a block reader function. A block can be any piece of text, such as a paragraph or a line, and tokens are parts of a block, such as individual words. In the Creating a part-of-speech tagged word corpus recipe, we discussed the default para_block_reader function of the TaggedCorpusReader class, which reads lines from a file until it finds a blank line, then returns those lines as a single paragraph token. The actual block reader function is nltk.corpus.reader.util.read_blankline_block. The TaggedCorpusReader class passes this block reader function into a TaggedCorpusView class whenever it needs to read blocks from a file. The TaggedCorpusView class is a subclass of StreamBackedCorpusView that knows to split paragraphs of word/tag into (word, tag) tuples.

How to do it...

We'll start with the simple case of a plain text file with a heading that should be ignored by the corpus reader. Let's make a file called heading_text.txt that looks like this:

A simple heading

Here is the actual text for the corpus.

Paragraphs are split by blanklines.

This is the 3rd paragraph.

Normally, we'd use the PlaintextCorpusReader class, but by default it will treat A simple heading as the first paragraph. To ignore this heading, we need to subclass the PlaintextCorpusReader class so we can override its CorpusView class variable with our own StreamBackedCorpusView subclass. The following is the code found in corpus.py:

from nltk.corpus.reader import PlaintextCorpusReader
from nltk.corpus.reader.util import StreamBackedCorpusView

class IgnoreHeadingCorpusView(StreamBackedCorpusView):
  def __init__(self, *args, **kwargs):
    StreamBackedCorpusView.__init__(self, *args, **kwargs)
    # open self._stream
    self._open()
    # skip the heading block
    self.read_block(self._stream)
    # reset the start position to the current position in the stream
    self._filepos = [self._stream.tell()]

class IgnoreHeadingCorpusReader(PlaintextCorpusReader):
  CorpusView = IgnoreHeadingCorpusView

To demonstrate that this works as expected, here is code showing that the default PlaintextCorpusReader class finds four paragraphs, while our IgnoreHeadingCorpusReader class only has three paragraphs:

>>> from nltk.corpus.reader import PlaintextCorpusReader
>>> plain = PlaintextCorpusReader('.', ['heading_text.txt'])
>>> len(plain.paras())
4
>>> from corpus import IgnoreHeadingCorpusReader
>>> reader = IgnoreHeadingCorpusReader('.', ['heading_text.txt'])
>>> len(reader.paras())
3

How it works...

The PlaintextCorpusReader class by design has a CorpusView class variable that can be overridden by subclasses. So we do just that, and make our IgnoreHeadingCorpusView class the CorpusView class variable.

Note

Most corpus readers do not have a CorpusView class variable because they require very specific corpus views.

The IgnoreHeadingCorpusView class is a subclass of StreamBackedCorpusView that does the following on initialization:

  1. Opens the file using self._open(). This function is defined by StreamBackedCorpusView, and sets the internal instance variable self._stream to the opened file.
  2. Reads one block with read_blankline_block(), which then reads the heading as a paragraph, and moves the stream's file position forward to the next block.
  3. Resets the start file position to the current position of self._stream. The self._filepos variable is an internal index of where each block is in the file.

The following is a diagram illustrating the relationships between the classes:

How it works...

There's more...

Corpus views can get a lot fancier and more complicated, but the core concept is the same: read blocks from a stream to return a list of tokens. There are a number of block readers provided in nltk.corpus.reader.util, but you can always create your own. If you do want to define your own block reader function, then you have two choices on how to implement it:

  1. Define it as a separate function and pass it into StreamBackedCorpusView as block_reader. This is a good option if your block reader is fairly simple, reusable, and doesn't require any outside variables or configuration.
  2. Subclass StreamBackedCorpusView and override the read_block() method. This is what many custom corpus views do because the block reading is highly specialized and requires additional functions and configuration, usually provided by the corpus reader when the corpus view is initialized.

Block reader functions

The following is a survey of most of the included block readers in nltk.corpus.reader.util. Unless otherwise mentioned, each block reader function takes a single argument: the stream argument to read from:

  • read_whitespace_block(): This will read 20 lines from the stream, splitting each line into tokens by whitespace.
  • read_wordpunct_block(): This reads 20 lines from the stream, splitting each line using nltk.tokenize.wordpunct_tokenize().
  • read_line_block(): This reads 20 lines from the stream and returns them as a list, with each line as a token.
  • read_blankline_block(): This will read lines from the stream until it finds a blank line. It will then return a single token of all lines found combined into a single string.
  • read_regexp_block(): This takes two additional arguments, which must be regular expressions that can be passed to re.match(): start_re and end_re. The start_re variable matches the starting line of a block, and end_re matches the ending line of the block. The end_re variable defaults to None, in which case the block will end as soon as a new start_re match is found. The return value is a single token of all lines in the block joined into a single string.

Pickle corpus view

If you want to have a corpus of pickled objects, you can use the PickleCorpusView, a subclass of StreamBackedCorpusView, found in nltk.corpus.reader.util. A file consists of blocks of pickled objects, and can be created with the PickleCorpusView.write() class method, which takes a sequence of objects and an output file, then pickles each object using pickle.dump() and writes it to the file. It overrides the read_block() method to return a list of unpickled objects from the stream, using pickle.load().

Concatenated corpus view

Also found in nltk.corpus.reader.util is the ConcatenatedCorpusView class. This class is useful if you have multiple files that you want a corpus reader to treat as a single file. A ConcatenatedCorpusView class is created by giving it a list of corpus_views, which are then iterated over as if they were a single view.

See also

The concept of block readers was introduced in the Creating a part-of-speech tagged word corpus recipe.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset