Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Creating a wordlist corpus

The WordListCorpusReader class is one of the simplest CorpusReader classes. It provides access to a file containing a list of words, one word per line. In fact, you've already used it when we used the stopwords corpus in Chapter 1, Tokenizing Text and WordNet Basics, in the Filtering stopwords in a tokenized sentence and Discovering word collocations recipes.

Getting ready

We need to start by creating a wordlist file. This could be a single column CSV file, or just a normal text file with one word per line. Let's create a file named wordlist that looks like this:

nltk
corpus
corpora
wordnet

How to do it...

Now we can instantiate a WordListCorpusReader class that will produce a list of words from our file. It takes two arguments: the directory path containing the files, and a list of filenames. If you open the Python console in the same directory as the files, then '.' can be used as the directory path. Otherwise, you must use a directory path such as nltk_data/corpora/cookbook:

>>> from nltk.corpus.reader import WordListCorpusReader
>>> reader = WordListCorpusReader('.', ['wordlist'])
>>> reader.words()
['nltk', 'corpus', 'corpora', 'wordnet']
>>> reader.fileids()
['wordlist']

How it works...

The WordListCorpusReader class inherits from CorpusReader, which is a common base class for all corpus readers. The CorpusReader class does all the work of identifying which files to read, while WordListCorpusReader reads the files and tokenizes each line to produce a list of words. The following is an inheritance diagram:

When you call the words() function, it calls nltk.tokenize.line_tokenize() on the raw file data, which you can access using the raw() function as follows:

>>> reader.raw()
'nltk
corpus
corpora
wordnet
'
>>> from nltk.tokenize import line_tokenize
>>> line_tokenize(reader.raw())
['nltk', 'corpus', 'corpora', 'wordnet']

There's more...

The stopwords corpus is a good example of a multifile WordListCorpusReader. In the Filtering stopwords in a tokenized sentence recipe in Chapter 1, Tokenizing Text and WordNet Basics, we saw that it had one wordlist file for each language, and you could access the words for that language by calling stopwords.words(fileid). If you want to create your own multifile wordlist corpus, this is a great example to follow.

Names wordlist corpus

Another wordlist corpus that comes with NLTK is the names corpus that is shown in the following code. It contains two files: female.txt and male.txt, each containing a list of a few thousand common first names organized by gender as follows:

>>> from nltk.corpus import names
>>> names.fileids()
['female.txt', 'male.txt']
>>> len(names.words('female.txt'))
5001
>>> len(names.words('male.txt'))
2943

English words corpus

NLTK also comes with a large list of English words. There's one file with 850 basic words, and another list with over 200,000 known English words, as shown in the following code:

>>> from nltk.corpus import words
>>> words.fileids()
['en', 'en-basic']
>>> len(words.words('en-basic'))
850
>>> len(words.words('en'))
234936

Table of Contents for
Creating a wordlist corpus

Creating a wordlist corpus

Getting ready

How to do it...

How it works...

There's more...

Names wordlist corpus

English words corpus

See also

Table of Contents for Creating a wordlist corpus

Create new playlist

Sign In

Sign Up

Creating a wordlist corpus

Getting ready

How to do it...

How it works...

There's more...

Names wordlist corpus

English words corpus

See also

Table of Contents for
Creating a wordlist corpus