All the corpus readers we've dealt with so far have been file-based. That is in part due to the design of the CorpusReader
base class, and also the assumption that most corpus data will be in text files. However, sometimes you'll have a bunch of data stored in a database that you want to access and use just like a text file corpus. In this recipe, we'll cover the case where you have documents in MongoDB, and you want to use a particular field of each document as your block of text.
MongoDB is a document-oriented database that has become a popular alternative to relational databases such as MySQL. The installation and setup of MongoDB is outside the scope of this book, but you can find instructions at http://docs.mongodb.org/manual/.
You'll also need to install PyMongo, a Python driver for MongoDB. You should be able to do this with either easy_install
or pip
, by typing sudo easy_install pymongo
or sudo pip install pymongo
.
The following code assumes that your database is on localhost port 27017
, which is the MongoDB default configuration, and that you'll be using the test database with a collection named corpus that contains documents with a text field. Explanations for these arguments are available in the PyMongo documentation at http://api.mongodb.org/python/current/.
Since the CorpusReader
class assumes you have a file-based corpus, we can't directly subclass it. Instead, we're going to emulate both the StreamBackedCorpusView
and PlaintextCorpusReader
classes. The StreamBackedCorpusView
class is a subclass of nltk.util.AbstractLazySequence
, so we'll subclass AbstractLazySequence
to create a MongoDB view, and then create a new class that will use the view to provide functionality similar to the PlaintextCorpusReader
class. The following is the code, which is found in mongoreader.py
:
import pymongo from nltk.data import LazyLoader from nltk.tokenize import TreebankWordTokenizer from nltk.util import AbstractLazySequence, LazyMap, LazyConcatenation class MongoDBLazySequence(AbstractLazySequence): def __init__(self, host='localhost', port=27017, db='test', collection='corpus', field='text'): self.conn = pymongo.MongoClient(host, port) self.collection = self.conn[db][collection] self.field = field def __len__(self): return self.collection.count() def iterate_from(self, start): f = lambda d: d.get(self.field, '') return iter(LazyMap(f, self.collection.find(fields=[self.field], skip=start))) class MongoDBCorpusReader(object): def __init__(self, word_tokenizer=TreebankWordTokenizer(), sent_tokenizer=LazyLoader('tokenizers/punkt/PY3/english.pickle'),**kwargs): self._seq = MongoDBLazySequence(**kwargs) self._word_tokenize = word_tokenizer.tokenize self._sent_tokenize = sent_tokenizer.tokenize def text(self): return self._seq def words(self): return LazyConcatenation(LazyMap(self._word_tokenize, self.text())) def sents(self): return LazyConcatenation(LazyMap(self._sent_tokenize, self.text()))
The AbstractLazySequence
class is an abstract class that provides read-only, on-demand iteration. Subclasses must implement the __len__()
and iterate_from(start)
methods, while it provides the rest of the list and iterator emulation methods. By creating the MongoDBLazySequence
subclass as our view, we can iterate over documents in the MongoDB collection on demand, without keeping all the documents in memory. The LazyMap
class is a lazy version of Python's built-in map()
function, and is used in iterate_from()
to transform the document into the specific field that we're interested in. It's also a subclass of AbstractLazySequence
.
The MongoDBCorpusReader
class creates an internal instance of MongoDBLazySequence
for iteration, then defines the word and sentence tokenization methods. The text()
method simply returns the instance of MongoDBLazySequence
, which results in a lazily evaluated list of each text field. The words()
method uses LazyMap
and LazyConcatenation
to return a lazily evaluated list of all words, while the sents()
method does the same for sentences. The sent_tokenizer
is loaded on demand with LazyLoader
, which is a wrapper around nltk.data.load()
, analogous to LazyCorpusLoader
. The LazyConcatentation
class is a subclass of AbstractLazySequence
too, and produces a flat list from a given list of lists (each list may also be lazy). In our case, we're concatenating the results of LazyMap
to ensure we don't return nested lists.
All of the parameters are configurable. For example, if you had a db
named website, with a collection named comments
, whose documents had a field called comment
, you could create a MongoDBCorpusReader
class as follows:
>>> reader = MongoDBCorpusReader(db='website', collection='comments', field='comment')
You can also pass in custom instances for word_tokenizer
and sent_tokenizer
, as long as the objects implement the nltk.tokenize.TokenizerI
interface by providing a tokenize(text)
method.
Corpus views were covered in the previous recipe, and tokenization was covered in Chapter 1, Tokenizing Text and WordNet Basics.