Creating a MongoDB-backed corpus reader

All the corpus readers we've dealt with so far have been file-based. That is in part due to the design of the CorpusReader base class, and also the assumption that most corpus data will be in text files. However, sometimes you'll have a bunch of data stored in a database that you want to access and use just like a text file corpus. In this recipe, we'll cover the case where you have documents in MongoDB, and you want to use a particular field of each document as your block of text.

Getting ready

MongoDB is a document-oriented database that has become a popular alternative to relational databases such as MySQL. The installation and setup of MongoDB is outside the scope of this book, but you can find instructions at http://docs.mongodb.org/manual/.

You'll also need to install PyMongo, a Python driver for MongoDB. You should be able to do this with either easy_install or pip, by typing sudo easy_install pymongo or sudo pip install pymongo.

The following code assumes that your database is on localhost port 27017, which is the MongoDB default configuration, and that you'll be using the test database with a collection named corpus that contains documents with a text field. Explanations for these arguments are available in the PyMongo documentation at http://api.mongodb.org/python/current/.

How to do it...

Since the CorpusReader class assumes you have a file-based corpus, we can't directly subclass it. Instead, we're going to emulate both the StreamBackedCorpusView and PlaintextCorpusReader classes. The StreamBackedCorpusView class is a subclass of nltk.util.AbstractLazySequence, so we'll subclass AbstractLazySequence to create a MongoDB view, and then create a new class that will use the view to provide functionality similar to the PlaintextCorpusReader class. The following is the code, which is found in mongoreader.py:

import pymongo
from nltk.data import LazyLoader
from nltk.tokenize import TreebankWordTokenizer
from nltk.util import AbstractLazySequence, LazyMap, LazyConcatenation

class MongoDBLazySequence(AbstractLazySequence):
  def __init__(self, host='localhost', port=27017, db='test', collection='corpus', field='text'):
    self.conn = pymongo.MongoClient(host, port)
    self.collection = self.conn[db][collection]
    self.field = field

  def __len__(self):
    return self.collection.count()

  def iterate_from(self, start):
    f = lambda d: d.get(self.field, '')
    return iter(LazyMap(f, self.collection.find(fields=[self.field], skip=start)))

class MongoDBCorpusReader(object):
  def __init__(self, word_tokenizer=TreebankWordTokenizer(),
    sent_tokenizer=LazyLoader('tokenizers/punkt/PY3/english.pickle'),**kwargs):
    self._seq = MongoDBLazySequence(**kwargs)
    self._word_tokenize = word_tokenizer.tokenize
    self._sent_tokenize = sent_tokenizer.tokenize

  def text(self):
    return self._seq

  def words(self):
    return LazyConcatenation(LazyMap(self._word_tokenize, self.text()))

  def sents(self):
    return LazyConcatenation(LazyMap(self._sent_tokenize, self.text()))

How it works...

The AbstractLazySequence class is an abstract class that provides read-only, on-demand iteration. Subclasses must implement the __len__() and iterate_from(start) methods, while it provides the rest of the list and iterator emulation methods. By creating the MongoDBLazySequence subclass as our view, we can iterate over documents in the MongoDB collection on demand, without keeping all the documents in memory. The LazyMap class is a lazy version of Python's built-in map() function, and is used in iterate_from() to transform the document into the specific field that we're interested in. It's also a subclass of AbstractLazySequence.

The MongoDBCorpusReader class creates an internal instance of MongoDBLazySequence for iteration, then defines the word and sentence tokenization methods. The text() method simply returns the instance of MongoDBLazySequence, which results in a lazily evaluated list of each text field. The words() method uses LazyMap and LazyConcatenation to return a lazily evaluated list of all words, while the sents() method does the same for sentences. The sent_tokenizer is loaded on demand with LazyLoader, which is a wrapper around nltk.data.load(), analogous to LazyCorpusLoader. The LazyConcatentation class is a subclass of AbstractLazySequence too, and produces a flat list from a given list of lists (each list may also be lazy). In our case, we're concatenating the results of LazyMap to ensure we don't return nested lists.

There's more...

All of the parameters are configurable. For example, if you had a db named website, with a collection named comments, whose documents had a field called comment, you could create a MongoDBCorpusReader class as follows:

>>> reader = MongoDBCorpusReader(db='website', collection='comments', field='comment')

You can also pass in custom instances for word_tokenizer and sent_tokenizer, as long as the objects implement the nltk.tokenize.TokenizerI interface by providing a tokenize(text) method.

See also

Corpus views were covered in the previous recipe, and tokenization was covered in Chapter 1, Tokenizing Text and WordNet Basics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset