Performing tokenization

When you are given any text, the first job is to tokenize the text into a format that is based on the given problem requirements. Tokenization is a very broad term; we can tokenize the text at the following various levels of granularity:

  • The paragraph level
  • The sentence level
  • The word level

In this section, we will see sentence level and word level tokenization. The methods are similar and can be easily applied to a paragraph level or any other level of granularity as required by the problem at hand.

Getting ready

We will see how to perform sentence level and word level tokenization in a single recipe.

How to do it…

Let's start with the demonstration of sentence tokenization:

# Load Libraries
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from collections import defaultdict


# 1.Let us use a very simple text to demonstrate tokenization
# at sentence level and word level. You have seen this example in the
# dictionary recipe, except for some punctuation which are added.

sentence = "Peter Piper picked a peck of pickled peppers. A peck of pickled 
peppers, Peter Piper picked !!! If Peter Piper picked a peck of pickled 
peppers, Wheres the peck of pickled peppers Peter Piper picked ?"

# 2.Using nltk sentence tokenizer, we tokenize the given text into sentences
# and verify the output using some print statements.

sent_list = sent_tokenize(sentence)

print "No sentences = %d"%(len(sent_list))
print "Sentences"
for sent in sent_list: print sent

# 3.With the sentences extracted let us proceed to extract
# words from these sentences.
word_dict = defaultdict(list)
for i,sent in enumerate(sent_list):
    word_dict[i].extend(word_tokenize(sent))

print word_dict

A quick peek at how NLTK performs its sentence tokenization in the following way:

def sent_tokenize(text, language='english'):
    """
    Return a sentence-tokenized copy of *text*,
    using NLTK's recommended sentence tokenizer
    (currently :class:`.PunktSentenceTokenizer`
    for the specified language).

    :param text: text to split into sentences
    :param language: the model name in the Punkt corpus
    """
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
    return tokenizer.tokenize(text)

How it works…

In step 1, we will initialize a variable sentence with a paragraph. This is the same example that we used in the dictionary recipe. In step 2, we will use nltk's sent_tokenize function to extract sentences from the given text. You can look into the source of sent_tokenize in nltk in the documentation found at http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent_tokenize.

As you can see, sent_tokenize loads a prebuilt tokenizer model, and using this model, it tokenizes the given text and returns the output. The tokenizer model is an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module. There are several pretrained instances of this tokenizer available in different languages. In our case, you can see that the language parameter is set to English.

Let's look at the output of this step:

No sentences = 3
Sentences
Peter Piper picked a peck of pickled peppers.
A peck of pickled             peppers, Peter Piper picked !!!
If Peter Piper picked a peck of pickled             peppers, Wheres the peck of pickled peppers Peter Piper picked ?

As you can see, the sentence tokenizer has split our input text into three sentences. Let's proceed to step 3, where we will tokenize these sentences into words. Here, we will use the word_tokenize function in order to extract the words from each of the sentences and store them in a dictionary, where the key is the sentence number and the value is the list of words for that sentence. Let's look at the output of the print statement:

defaultdict(<type 'list'>, {0: ['Peter', 'Piper', 'picked', 'a', 'peck', 'of', 'pickled', 'peppers', '.'], 1: ['A', 'peck', 'of', 'pickled', 'peppers', ',', 'Peter', 'Piper', 'picked', '!', '!', '!'], 2: ['If', 'Peter', 'Piper', 'picked', 'a', 'peck', 'of', 'pickled', 'peppers', ',', 'Wheres', 'the', 'peck', 'of', 'pickled', 'peppers', 'Peter', 'Piper', 'picked', '?']})

The word_tokenize uses a regular expression to split the sentences into words. It will be useful to look at the source of word_tokenize found at http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenize.

There's more…

For sentence tokenization, we saw a way of doing it in NLTK. There are other methods available. The nltk.tokenize.simple module has a line_tokenize method. Let's take the same input sentence as before and run it using line_tokenize:

# Load Libraries
from nltk.tokenize import line_tokenize


sentence = "Peter Piper picked a peck of pickled peppers. A peck of pickled 
peppers, Peter Piper picked !!! If Peter Piper picked a peck of pickled 
peppers, Wheres the peck of pickled peppers Peter Piper picked ?"


sent_list = line_tokenize(sentence)
print "No sentences = %d"%(len(sent_list))
print "Sentences"
for sent in sent_list: print sent

# Include new line characters
sentence = "Peter Piper picked a peck of pickled peppers. A peck of pickled
 
peppers, Peter Piper picked !!! If Peter Piper picked a peck of pickled
 
peppers, Wheres the peck of pickled peppers Peter Piper picked ?"

sent_list = line_tokenize(sentence)
print "No sentences = %d"%(len(sent_list))
print "Sentences"
for sent in sent_list: print sent

The output is as follows:

No sentences = 1
Sentences
Peter Piper picked a peck of pickled peppers. A peck of pickled             peppers, Peter Piper picked !!! If Peter Piper picked a peck of pickled             peppers, Wheres the peck of pickled peppers Peter Piper picked ?

You can see that we have only the sentence retrieved from the input.

Let's now modify our input in order to include new line characters:

sentence = "Peter Piper picked a peck of pickled peppers. A peck of pickled
 
peppers, Peter Piper picked !!! If Peter Piper picked a peck of pickled
 
peppers, Wheres the peck of pickled peppers Peter Piper picked ?"

Note that we have a new line character added. We will again apply line_tokenize to get the following output:

No sentences = 3
Sentences
Peter Piper picked a peck of pickled peppers. A peck of pickled
             peppers, Peter Piper picked !!! If Peter Piper picked a peck of pickled
             peppers, Wheres the peck of pickled peppers Peter Piper picked ?

You can see that it has tokenized our sentences at the new line and now we have three sentences.

See Chapter 3 of the NLTK book; it has more references for sentence and word tokenization. It can be found at http://www.nltk.org/book/ch03.html.

See also

  • Using Dictionary object recipe in Chapter 1, Using Python for Data Science
  • Writing list recipe in Chapter 1, Using Python for Data Science
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset