Natural Language Processing (NLP) is concerned with the interaction between natural language and the computer. It is one of the major components of Artificial Intelligence (AI) and computational linguistics. It provides a seamless interaction between computers and human beings and gives computers the ability to understand human speech with the help of machine learning. The fundamental data type used to represent the contents of a file or a document in programming languages (for example, C, C++, JAVA, Python, and so on) is known as string. In this chapter, we will explore various operations that can be performed on strings that will be useful to accomplish various NLP tasks.
This chapter will include the following topics:
Tokenization may be defined as the process of splitting the text into smaller parts called tokens, and is considered a crucial step in NLP.
When NLTK is installed and Python IDLE is running, we can perform the tokenization of text or paragraphs into individual sentences. To perform tokenization, we can import the sentence tokenization function. The argument of this function will be text that needs to be tokenized. The sent_tokenize
function uses an instance of NLTK known as PunktSentenceTokenizer
. This instance of NLTK has already been trained to perform tokenization on different European languages on the basis of letters or punctuation that mark the beginning and end of sentences.
Now, let's see how a given text is tokenized into individual sentences:
>>> import nltk >>> text=" Welcome readers. I hope you find it interesting. Please do reply." >>> from nltk.tokenize import sent_tokenize >>> sent_tokenize(text) [' Welcome readers.', 'I hope you find it interesting.', 'Please do reply.']
So, a given text is split into individual sentences. Further, we can perform processing on the individual sentences.
To tokenize a large number of sentences, we can load PunktSentenceTokenizer
and use the tokenize()
function to perform tokenization. This can be seen in the following code:
>>> import nltk >>> tokenizer=nltk.data.load('tokenizers/punkt/english.pickle') >>> text=" Hello everyone. Hope all are fine and doing well. Hope you find the book interesting" >>> tokenizer.tokenize(text) [' Hello everyone.', 'Hope all are fine and doing well.', 'Hope you find the book interesting']
For performing tokenization in languages other than English, we can load the respective language pickle file found in tokenizers/punkt
and then tokenize the text in another language, which is an argument of the tokenize()
function. For the tokenization of French text, we will use the french.pickle
file as follows:
>> import nltk >>> french_tokenizer=nltk.data.load('tokenizers/punkt/french.pickle') >>> french_tokenizer.tokenize('Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage collège franco-britanniquedeLevallois-Perret. Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage Levallois. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, janvier , d'un professeur d'histoire. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, mercredi , d'un professeur d'histoire') ['Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage collège franco-britanniquedeLevallois-Perret.', 'Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage Levallois.', 'L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, janvier , d'un professeur d'histoire.', 'L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, mercredi , d'un professeur d'histoire']
Now, we'll perform processing on individual sentences. Individual sentences are tokenized into words. Word tokenization is performed using a word_tokenize()
function. The word_tokenize
function uses an instance of NLTK known as TreebankWordTokenizer
to perform word tokenization.
The tokenization of English text using word_tokenize
is shown here:
>>> import nltk >>> text=nltk.word_tokenize("PierreVinken , 59 years old , will join as a nonexecutive director on Nov. 29 .») >>> print(text) [' PierreVinken', ',', '59', ' years', ' old', ',', 'will', 'join', 'as', 'a', 'nonexecutive', 'director' , 'on', 'Nov.', '29', '.']
Tokenization of words can also be done by loading TreebankWordTokenizer
and then calling the tokenize()
function, whose argument is a sentence that needs to be tokenized into words. This instance of NLTK has already been trained to perform the tokenization of sentence into words on the basis of spaces and punctuation.
The following code will help us obtain user input, tokenize it, and evaluate its length:
>>> import nltk >>> from nltk import word_tokenize >>> r=input("Please write a text") Please write a textToday is a pleasant day >>> print("The length of text is",len(word_tokenize(r)),"words") The length of text is 5 words
Let's have a look at the code that performs tokenization using TreebankWordTokenizer
:
>>> import nltk >>> from nltk.tokenize import TreebankWordTokenizer >>> tokenizer = TreebankWordTokenizer() >>> tokenizer.tokenize("Have a nice day. I hope you find the book interesting") ['Have', 'a', 'nice', 'day.', 'I', 'hope', 'you', 'find', 'the', 'book', 'interesting']
TreebankWordTokenizer
uses conventions according to Penn Treebank Corpus. It works by separating contractions. This is shown here:
>>> import nltk >>> text=nltk.word_tokenize(" Don't hesitate to ask questions") >>> print(text) ['Do', "n't", 'hesitate', 'to', 'ask', 'questions']
Another word tokenizer is PunktWordTokenizer
. It works by splitting punctuation; each word is kept instead of creating an entirely new token. Another word tokenizer is WordPunctTokenizer
. It provides splitting by making punctuation an entirely new token. This type of splitting is usually desirable:
>>> from nltk.tokenize import WordPunctTokenizer >>> tokenizer=WordPunctTokenizer() >>> tokenizer.tokenize(" Don't hesitate to ask questions") ['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions']
The inheritance tree for tokenizers is given here:
The tokenization of words can be performed by constructing regular expressions in these two ways:
We can import RegexpTokenizer
from NLTK. We can create a Regular Expression that can match the tokens present in the text:
>>> import nltk >>> from nltk.tokenize import RegexpTokenizer >>> tokenizer=RegexpTokenizer([w]+") >>> tokenizer.tokenize("Don't hesitate to ask questions") ["Don't", 'hesitate', 'to', 'ask', 'questions']
Instead of instantiating class, an alternative way of tokenization would be to use this function:
>>> import nltk >>> from nltk.tokenize import regexp_tokenize >>> sent="Don't hesitate to ask questions" >>> print(regexp_tokenize(sent, pattern='w+|$[d.]+|S+')) ['Don', "'t", 'hesitate', 'to', 'ask', 'questions']
RegularexpTokenizer
uses the re.findall()
function to perform tokenization by matching tokens. It uses the re.split()
function to perform tokenization by matching gaps or spaces.
Let's have a look at an example of how to tokenize using whitespaces:
>>> import nltk >>> from nltk.tokenize import RegexpTokenizer >>> tokenizer=RegexpTokenizer('s+',gaps=True) >>> tokenizer.tokenize("Don't hesitate to ask questions") ["Don't", 'hesitate', 'to', 'ask', 'questions']
To select the words starting with a capital letter, the following code is used:
>>> import nltk >>> from nltk.tokenize import RegexpTokenizer >>> sent=" She secured 90.56 % in class X . She is a meritorious student" >>> capt = RegexpTokenizer('[A-Z]w+') >>> capt.tokenize(sent) ['She', 'She']
The following code shows how a predefined Regular Expression is used by a subclass of RegexpTokenizer
:
>>> import nltk >>> sent=" She secured 90.56 % in class X . She is a meritorious student" >>> from nltk.tokenize import BlanklineTokenizer >>> BlanklineTokenizer().tokenize(sent) [' She secured 90.56 % in class X . She is a meritorious student ']
The tokenization of strings can be done using whitespace—tab, space, or newline:
>>> import nltk >>> sent=" She secured 90.56 % in class X . She is a meritorious student" >>> from nltk.tokenize import WhitespaceTokenizer >>> WhitespaceTokenizer().tokenize(sent) ['She', 'secured', '90.56', '%', 'in', 'class', 'X', '.', 'She', 'is', 'a', 'meritorious', 'student']
WordPunctTokenizer
makes use of the regular expression w+|[^ws]+
to perform the tokenization of text into alphabetic and non-alphabetic characters.
Tokenization using the split()
method is depicted in the following code:
>>>import nltk >>>sent= She secured 90.56 % in class X. She is a meritorious student" >>> sent.split() ['She', 'secured', '90.56', '%', 'in', 'class', 'X', '.', 'She', 'is', 'a', 'meritorious', 'student'] >>> sent.split('') ['', 'She', 'secured', '90.56', '%', 'in', 'class', 'X', '.', 'She', 'is', 'a', 'meritorious', 'student'] >>> sent=" She secured 90.56 % in class X . She is a meritorious student " >>> sent.split(' ') [' She secured 90.56 % in class X ', '. She is a meritorious student', '']
Similar to sent.split('
')
, LineTokenizer
works by tokenizing text into lines:
>>> import nltk >>> from nltk.tokenize import BlanklineTokenizer >>> sent=" She secured 90.56 % in class X . She is a meritorious student " >>> BlanklineTokenizer().tokenize(sent) [' She secured 90.56 % in class X . She is a meritorious student '] >>> from nltk.tokenize import LineTokenizer >>> LineTokenizer(blanklines='keep').tokenize(sent) [' She secured 90.56 % in class X ', '. She is a meritorious student'] >>> LineTokenizer(blanklines='discard').tokenize(sent) [' She secured 90.56 % in class X ', '. She is a meritorious student']
SpaceTokenizer
works similar to sent.split('')
:
>>> import nltk >>> sent=" She secured 90.56 % in class X . She is a meritorious student " >>> from nltk.tokenize import SpaceTokenizer >>> SpaceTokenizer().tokenize(sent) ['', 'She', 'secured', '90.56', '%', 'in', 'class', 'X', ' .', 'She', 'is', 'a', 'meritorious', 'student ']
nltk.tokenize.util
module works by returning the sequence of tuples that are offsets of the tokens in a sentence:
>>> import nltk >>> from nltk.tokenize import WhitespaceTokenizer >>> sent=" She secured 90.56 % in class X . She is a meritorious student " >>> list(WhitespaceTokenizer().span_tokenize(sent)) [(1, 4), (5, 12), (13, 18), (19, 20), (21, 23), (24, 29), (30, 31), (33, 34), (35, 38), (39, 41), (42, 43), (44, 55), (56, 63)]
Given a sequence of spans, the sequence of relative spans can be returned:
>>> import nltk >>> from nltk.tokenize import WhitespaceTokenizer >>> from nltk.tokenize.util import spans_to_relative >>> sent=" She secured 90.56 % in class X . She is a meritorious student " >>>list(spans_to_relative(WhitespaceTokenizer().span_tokenize(sent))) [(1, 3), (1, 7), (1, 5), (1, 1), (1, 2), (1, 5), (1, 1), (2, 1), (1, 3), (1, 2), (1, 1), (1, 11), (1, 7)]
nltk.tokenize.util.string_span_tokenize(sent,separator)
will return the offsets of tokens in sent by splitting at each incidence of the separator:
>>> import nltk >>> from nltk.tokenize.util import string_span_tokenize >>> sent=" She secured 90.56 % in class X . She is a meritorious student " >>> list(string_span_tokenize(sent, "")) [(1, 4), (5, 12), (13, 18), (19, 20), (21, 23), (24, 29), (30, 31), (32, 34), (35, 38), (39, 41), (42, 43), (44, 55), (56, 64)]