The WordListCorpusReader
class is one of the simplest CorpusReader
classes. It provides access to a file containing a list of words, one word per line. In fact, you've already used it when we used the stopwords corpus in Chapter 1, Tokenizing Text and WordNet Basics, in the Filtering stopwords in a tokenized sentence and Discovering word collocations recipes.
We need to start by creating a wordlist file. This could be a single column CSV file, or just a normal text file with one word per line. Let's create a file named wordlist
that looks like this:
nltk corpus corpora wordnet
Now we can instantiate a WordListCorpusReader
class that will produce a list of words from our file. It takes two arguments: the directory path containing the files, and a list of filenames. If you open the Python console in the same directory as the files, then '.'
can be used as the directory path. Otherwise, you must use a directory path such as nltk_data/corpora/cookbook
:
>>> from nltk.corpus.reader import WordListCorpusReader >>> reader = WordListCorpusReader('.', ['wordlist']) >>> reader.words() ['nltk', 'corpus', 'corpora', 'wordnet'] >>> reader.fileids() ['wordlist']
The WordListCorpusReader
class inherits from CorpusReader
, which is a common base class for all corpus readers. The CorpusReader
class does all the work of identifying which files to read, while WordListCorpusReader
reads the files and tokenizes each line to produce a list of words. The following is an inheritance diagram:
When you call the words()
function, it calls nltk.tokenize.line_tokenize()
on the raw file data, which you can access using the raw()
function as follows:
>>> reader.raw() 'nltk corpus corpora wordnet ' >>> from nltk.tokenize import line_tokenize >>> line_tokenize(reader.raw()) ['nltk', 'corpus', 'corpora', 'wordnet']
The stopwords
corpus is a good example of a multifile WordListCorpusReader
. In the Filtering stopwords in a tokenized sentence recipe in Chapter 1, Tokenizing Text and WordNet Basics, we saw that it had one wordlist file for each language, and you could access the words for that language by calling stopwords.words(fileid)
. If you want to create your own multifile wordlist corpus, this is a great example to follow.
Another wordlist corpus that comes with NLTK is the names
corpus that is shown in the following code. It contains two files: female.txt
and male.txt
, each containing a list of a few thousand common first names organized by gender as follows:
>>> from nltk.corpus import names >>> names.fileids() ['female.txt', 'male.txt'] >>> len(names.words('female.txt')) 5001 >>> len(names.words('male.txt')) 2943
NLTK also comes with a large list of English words. There's one file with 850 basic words, and another list with over 200,000 known English words, as shown in the following code:
>>> from nltk.corpus import words >>> words.fileids() ['en', 'en-basic'] >>> len(words.words('en-basic')) 850 >>> len(words.words('en')) 234936
The Filtering stopwords in a tokenized sentence recipe in Chapter 1, Tokenizing Text and WordNet Basics, has more details on using the stopwords
corpus. In the following recipes, we'll cover more advanced corpus file formats and corpus reader classes.