Creating a wordlist corpus

The WordListCorpusReader class is one of the simplest CorpusReader classes. It provides access to a file containing a list of words, one word per line. In fact, you've already used it when we used the stopwords corpus in Chapter 1, Tokenizing Text and WordNet Basics, in the Filtering stopwords in a tokenized sentence and Discovering word collocations recipes.

Getting ready

We need to start by creating a wordlist file. This could be a single column CSV file, or just a normal text file with one word per line. Let's create a file named wordlist that looks like this:

nltk
corpus
corpora
wordnet

How to do it...

Now we can instantiate a WordListCorpusReader class that will produce a list of words from our file. It takes two arguments: the directory path containing the files, and a list of filenames. If you open the Python console in the same directory as the files, then '.' can be used as the directory path. Otherwise, you must use a directory path such as nltk_data/corpora/cookbook:

>>> from nltk.corpus.reader import WordListCorpusReader
>>> reader = WordListCorpusReader('.', ['wordlist'])
>>> reader.words()
['nltk', 'corpus', 'corpora', 'wordnet']
>>> reader.fileids()
['wordlist']

How it works...

The WordListCorpusReader class inherits from CorpusReader, which is a common base class for all corpus readers. The CorpusReader class does all the work of identifying which files to read, while WordListCorpusReader reads the files and tokenizes each line to produce a list of words. The following is an inheritance diagram:

How it works...

When you call the words() function, it calls nltk.tokenize.line_tokenize() on the raw file data, which you can access using the raw() function as follows:

>>> reader.raw()
'nltk
corpus
corpora
wordnet
'
>>> from nltk.tokenize import line_tokenize
>>> line_tokenize(reader.raw())
['nltk', 'corpus', 'corpora', 'wordnet']

There's more...

The stopwords corpus is a good example of a multifile WordListCorpusReader. In the Filtering stopwords in a tokenized sentence recipe in Chapter 1, Tokenizing Text and WordNet Basics, we saw that it had one wordlist file for each language, and you could access the words for that language by calling stopwords.words(fileid). If you want to create your own multifile wordlist corpus, this is a great example to follow.

Names wordlist corpus

Another wordlist corpus that comes with NLTK is the names corpus that is shown in the following code. It contains two files: female.txt and male.txt, each containing a list of a few thousand common first names organized by gender as follows:

>>> from nltk.corpus import names
>>> names.fileids()
['female.txt', 'male.txt']
>>> len(names.words('female.txt'))
5001
>>> len(names.words('male.txt'))
2943

English words corpus

NLTK also comes with a large list of English words. There's one file with 850 basic words, and another list with over 200,000 known English words, as shown in the following code:

>>> from nltk.corpus import words
>>> words.fileids()
['en', 'en-basic']
>>> len(words.words('en-basic'))
850
>>> len(words.words('en'))
234936

See also

The Filtering stopwords in a tokenized sentence recipe in Chapter 1, Tokenizing Text and WordNet Basics, has more details on using the stopwords corpus. In the following recipes, we'll cover more advanced corpus file formats and corpus reader classes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset