When you are given any text, the first job is to tokenize the text into a format that is based on the given problem requirements. Tokenization is a very broad term; we can tokenize the text at the following various levels of granularity:
In this section, we will see sentence level and word level tokenization. The methods are similar and can be easily applied to a paragraph level or any other level of granularity as required by the problem at hand.
We will see how to perform sentence level and word level tokenization in a single recipe.
Let's start with the demonstration of sentence tokenization:
# Load Libraries from nltk.tokenize import sent_tokenize from nltk.tokenize import word_tokenize from collections import defaultdict # 1.Let us use a very simple text to demonstrate tokenization # at sentence level and word level. You have seen this example in the # dictionary recipe, except for some punctuation which are added. sentence = "Peter Piper picked a peck of pickled peppers. A peck of pickled peppers, Peter Piper picked !!! If Peter Piper picked a peck of pickled peppers, Wheres the peck of pickled peppers Peter Piper picked ?" # 2.Using nltk sentence tokenizer, we tokenize the given text into sentences # and verify the output using some print statements. sent_list = sent_tokenize(sentence) print "No sentences = %d"%(len(sent_list)) print "Sentences" for sent in sent_list: print sent # 3.With the sentences extracted let us proceed to extract # words from these sentences. word_dict = defaultdict(list) for i,sent in enumerate(sent_list): word_dict[i].extend(word_tokenize(sent)) print word_dict
A quick peek at how NLTK performs its sentence tokenization in the following way:
def sent_tokenize(text, language='english'): """ Return a sentence-tokenized copy of *text*, using NLTK's recommended sentence tokenizer (currently :class:`.PunktSentenceTokenizer` for the specified language). :param text: text to split into sentences :param language: the model name in the Punkt corpus """ tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language)) return tokenizer.tokenize(text)
In step 1, we will initialize a variable sentence with a paragraph. This is the same example that we used in the dictionary recipe. In step 2, we will use nltk's sent_tokenize
function to extract sentences from the given text.
You can look into the source of sent_tokenize
in nltk in the documentation found at http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent_tokenize.
As you can see, sent_tokenize
loads a prebuilt tokenizer model, and using this model, it tokenizes the given text and returns the output. The tokenizer model is an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt
module. There are several pretrained instances of this tokenizer available in different languages. In our case, you can see that the language parameter is set to English.
Let's look at the output of this step:
No sentences = 3 Sentences Peter Piper picked a peck of pickled peppers. A peck of pickled peppers, Peter Piper picked !!! If Peter Piper picked a peck of pickled peppers, Wheres the peck of pickled peppers Peter Piper picked ?
As you can see, the sentence tokenizer has split our input text into three sentences. Let's proceed to step 3, where we will tokenize these sentences into words. Here, we will use the word_tokenize
function in order to extract the words from each of the sentences and store them in a dictionary, where the key is the sentence number and the value is the list of words for that sentence. Let's look at the output of the print statement:
defaultdict(<type 'list'>, {0: ['Peter', 'Piper', 'picked', 'a', 'peck', 'of', 'pickled', 'peppers', '.'], 1: ['A', 'peck', 'of', 'pickled', 'peppers', ',', 'Peter', 'Piper', 'picked', '!', '!', '!'], 2: ['If', 'Peter', 'Piper', 'picked', 'a', 'peck', 'of', 'pickled', 'peppers', ',', 'Wheres', 'the', 'peck', 'of', 'pickled', 'peppers', 'Peter', 'Piper', 'picked', '?']})
The word_tokenize
uses a regular expression to split the sentences into words. It will be useful to look at the source of word_tokenize
found at http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenize.
For sentence tokenization, we saw a way of doing it in NLTK. There are other methods available. The nltk.tokenize.simple
module has a line_tokenize
method. Let's take the same input sentence as before and run it using line_tokenize
:
# Load Libraries from nltk.tokenize import line_tokenize sentence = "Peter Piper picked a peck of pickled peppers. A peck of pickled peppers, Peter Piper picked !!! If Peter Piper picked a peck of pickled peppers, Wheres the peck of pickled peppers Peter Piper picked ?" sent_list = line_tokenize(sentence) print "No sentences = %d"%(len(sent_list)) print "Sentences" for sent in sent_list: print sent # Include new line characters sentence = "Peter Piper picked a peck of pickled peppers. A peck of pickled peppers, Peter Piper picked !!! If Peter Piper picked a peck of pickled peppers, Wheres the peck of pickled peppers Peter Piper picked ?" sent_list = line_tokenize(sentence) print "No sentences = %d"%(len(sent_list)) print "Sentences" for sent in sent_list: print sent
The output is as follows:
No sentences = 1 Sentences Peter Piper picked a peck of pickled peppers. A peck of pickled peppers, Peter Piper picked !!! If Peter Piper picked a peck of pickled peppers, Wheres the peck of pickled peppers Peter Piper picked ?
You can see that we have only the sentence retrieved from the input.
Let's now modify our input in order to include new line characters:
sentence = "Peter Piper picked a peck of pickled peppers. A peck of pickled peppers, Peter Piper picked !!! If Peter Piper picked a peck of pickled peppers, Wheres the peck of pickled peppers Peter Piper picked ?"
Note that we have a new line character added. We will again apply line_tokenize
to get the following output:
No sentences = 3 Sentences Peter Piper picked a peck of pickled peppers. A peck of pickled peppers, Peter Piper picked !!! If Peter Piper picked a peck of pickled peppers, Wheres the peck of pickled peppers Peter Piper picked ?
You can see that it has tokenized our sentences at the new line and now we have three sentences.
See Chapter 3 of the NLTK book; it has more references for sentence and word tokenization. It can be found at http://www.nltk.org/book/ch03.html.