Loading a corpus reader can be an expensive operation due to the number of files, file sizes, and various initialization tasks. And while you'll often want to specify a corpus reader in a common module, you don't always need to access it right away. To speed up module import time when a corpus reader is defined, NLTK provides a LazyCorpusLoader
class that can transform itself into your actual corpus reader as soon as you need it. This way, you can define a corpus reader in a common module without it slowing down module loading.
The LazyCorpusLoader
class requires two arguments: the name of the corpus and the corpus reader class, plus any other arguments needed to initialize the corpus reader class.
The name
argument specifies the root directory name of the corpus, which must be within a corpora
subdirectory of one of the paths in nltk.data.path
. See the Setting up a custom corpus recipe of this chapter for more details on nltk.data.path
.
For example, if you have a custom corpora named cookbook
in your local nltk_data
directory, its path would be ~/nltk_data/corpora/cookbook
. You'd then pass 'cookbook'
to LazyCorpusLoader
as the name, and LazyCorpusLoader
will look in ~/nltk_data/corpora
for a directory named 'cookbook'
.
The second argument to LazyCorpusLoader
is reader_cls
, which should be the name of a subclass of CorpusReader
, such as WordListCorpusReader
. You will also need to pass in any other arguments required by the reader_cls
argument for initialization. This will be demonstrated as follows, using the same wordlist file we created in the earlier recipe, Creating a wordlist corpus. The third argument to LazyCorpusLoader
is the list of filenames and fileids
that will be passed to WordListCorpusReader
at initialization:
>>> from nltk.corpus.util import LazyCorpusLoader >>> from nltk.corpus.reader import WordListCorpusReader >>> reader = LazyCorpusLoader('cookbook', WordListCorpusReader, ['wordlist']) >>> isinstance(reader, LazyCorpusLoader) True >>> reader.fileids() ['wordlist'] >>> isinstance(reader, LazyCorpusLoader) False >>> isinstance(reader, WordListCorpusReader) True
The LazyCorpusLoader
class stores all the arguments given, but otherwise does nothing until you try to access an attribute or method. This way, initialization is very fast, eliminating the overhead of loading the corpus reader immediately. As soon as you do access an attribute or method, it does the following:
nltk.data.find('corpora/%s' % name)
to find the corpus data root directory.So in the previous example code, before we call reader.fileids()
, reader is an instance of LazyCorpusLoader
, but after the call, reader becomes an instance of WordListCorpusReader
.
All of the corpora included with NLTK and defined in nltk.corpus
are initially a LazyCorpusLoader
class. The following is some code from nltk.corpus
defining the treebank
corpora:
treebank = LazyCorpusLoader('treebank/combined', BracketParseCorpusReader, r'wsj_.*.mrg',tagset='wsj', encoding='ascii') treebank_chunk = LazyCorpusLoader('treebank/tagged', ChunkedCorpusReader, r'wsj_.*.pos',sent_tokenizer=RegexpTokenizer(r'(?<=/.)s*(?![^[]*])', gaps=True), para_block_reader=tagged_treebank_para_block_reader, encoding='ascii') treebank_raw = LazyCorpusLoader('treebank/raw', PlaintextCorpusReader, r'wsj_.*', encoding='ISO-8859-2')
As you can see in the previous code, any number of additional arguments can be passed through by LazyCorpusLoader
to its reader_cls
argument.