Chapter 1. Working with Strings

Natural Language Processing (NLP) is concerned with the interaction between natural language and the computer. It is one of the major components of Artificial Intelligence (AI) and computational linguistics. It provides a seamless interaction between computers and human beings and gives computers the ability to understand human speech with the help of machine learning. The fundamental data type used to represent the contents of a file or a document in programming languages (for example, C, C++, JAVA, Python, and so on) is known as string. In this chapter, we will explore various operations that can be performed on strings that will be useful to accomplish various NLP tasks.

This chapter will include the following topics:

  • Tokenization of text
  • Normalization of text
  • Substituting and correcting tokens
  • Applying Zipf's law to text
  • Applying similarity measures using the Edit Distance Algorithm
  • Applying similarity measures using Jaccard's Coefficient
  • Applying similarity measures using Smith Waterman

Tokenization

Tokenization may be defined as the process of splitting the text into smaller parts called tokens, and is considered a crucial step in NLP.

When NLTK is installed and Python IDLE is running, we can perform the tokenization of text or paragraphs into individual sentences. To perform tokenization, we can import the sentence tokenization function. The argument of this function will be text that needs to be tokenized. The sent_tokenize function uses an instance of NLTK known as PunktSentenceTokenizer. This instance of NLTK has already been trained to perform tokenization on different European languages on the basis of letters or punctuation that mark the beginning and end of sentences.

Tokenization of text into sentences

Now, let's see how a given text is tokenized into individual sentences:

>>> import nltk
>>> text=" Welcome readers. I hope you find it interesting. Please do reply."
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize(text)
[' Welcome readers.', 'I hope you find it interesting.', 'Please do reply.']

So, a given text is split into individual sentences. Further, we can perform processing on the individual sentences.

To tokenize a large number of sentences, we can load PunktSentenceTokenizer and use the tokenize() function to perform tokenization. This can be seen in the following code:

>>> import nltk
>>> tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
>>> text=" Hello everyone. Hope all are fine and doing well. Hope you find the book interesting"
>>> tokenizer.tokenize(text)
[' Hello everyone.', 'Hope all are fine and doing well.', 'Hope you find the book interesting']

Tokenization of text in other languages

For performing tokenization in languages other than English, we can load the respective language pickle file found in tokenizers/punkt and then tokenize the text in another language, which is an argument of the tokenize() function. For the tokenization of French text, we will use the french.pickle file as follows:

>> import nltk
>>> french_tokenizer=nltk.data.load('tokenizers/punkt/french.pickle')
>>> french_tokenizer.tokenize('Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage  collège franco-britanniquedeLevallois-Perret. Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage  Levallois. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, janvier , d'un professeur d'histoire. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, mercredi , d'un professeur d'histoire')
['Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage  collège franco-britanniquedeLevallois-Perret.', 'Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage  Levallois.', 'L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, janvier , d'un professeur d'histoire.', 'L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, mercredi , d'un professeur d'histoire']

Tokenization of sentences into words

Now, we'll perform processing on individual sentences. Individual sentences are tokenized into words. Word tokenization is performed using a word_tokenize() function. The word_tokenize function uses an instance of NLTK known as TreebankWordTokenizer to perform word tokenization.

The tokenization of English text using word_tokenize is shown here:

>>> import nltk
>>> text=nltk.word_tokenize("PierreVinken , 59 years old , will join as a nonexecutive director on Nov. 29 .»)
>>> print(text)
[' PierreVinken', ',', '59', ' years', ' old', ',', 'will', 'join', 'as', 'a', 'nonexecutive', 'director' , 'on', 'Nov.', '29', '.']

Tokenization of words can also be done by loading TreebankWordTokenizer and then calling the tokenize() function, whose argument is a sentence that needs to be tokenized into words. This instance of NLTK has already been trained to perform the tokenization of sentence into words on the basis of spaces and punctuation.

The following code will help us obtain user input, tokenize it, and evaluate its length:

>>> import nltk
>>> from nltk import word_tokenize
>>> r=input("Please write a text")
Please write a textToday is a pleasant day
>>> print("The length of text is",len(word_tokenize(r)),"words")
The length of text is 5 words

Tokenization using TreebankWordTokenizer

Let's have a look at the code that performs tokenization using TreebankWordTokenizer:

>>> import nltk
>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> tokenizer.tokenize("Have a nice day. I hope you find the book interesting")
['Have', 'a', 'nice', 'day.', 'I', 'hope', 'you', 'find', 'the', 'book', 'interesting']

TreebankWordTokenizer uses conventions according to Penn Treebank Corpus. It works by separating contractions. This is shown here:

>>> import nltk
>>> text=nltk.word_tokenize(" Don't hesitate to ask questions")
>>> print(text)
['Do', "n't", 'hesitate', 'to', 'ask', 'questions']

Another word tokenizer is PunktWordTokenizer. It works by splitting punctuation; each word is kept instead of creating an entirely new token. Another word tokenizer is WordPunctTokenizer. It provides splitting by making punctuation an entirely new token. This type of splitting is usually desirable:

>>> from nltk.tokenize import WordPunctTokenizer
>>> tokenizer=WordPunctTokenizer()
>>> tokenizer.tokenize(" Don't hesitate to ask questions")
['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions']

The inheritance tree for tokenizers is given here:

Tokenization using TreebankWordTokenizer

Tokenization using regular expressions

The tokenization of words can be performed by constructing regular expressions in these two ways:

  • By matching with words
  • By matching spaces or gaps

We can import RegexpTokenizer from NLTK. We can create a Regular Expression that can match the tokens present in the text:

>>> import nltk
>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer=RegexpTokenizer([w]+")
>>> tokenizer.tokenize("Don't hesitate to ask questions")
["Don't", 'hesitate', 'to', 'ask', 'questions']

Instead of instantiating class, an alternative way of tokenization would be to use this function:

>>> import nltk
>>> from nltk.tokenize import regexp_tokenize
>>> sent="Don't hesitate to ask questions"
>>> print(regexp_tokenize(sent, pattern='w+|$[d.]+|S+'))
['Don', "'t", 'hesitate', 'to', 'ask', 'questions']

RegularexpTokenizer uses the re.findall()function to perform tokenization by matching tokens. It uses the re.split() function to perform tokenization by matching gaps or spaces.

Let's have a look at an example of how to tokenize using whitespaces:

>>> import nltk
>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer=RegexpTokenizer('s+',gaps=True)
>>> tokenizer.tokenize("Don't hesitate to ask questions")
["Don't", 'hesitate', 'to', 'ask', 'questions']

To select the words starting with a capital letter, the following code is used:

>>> import nltk
>>> from nltk.tokenize import RegexpTokenizer
>>> sent=" She secured 90.56 % in class X . She is a meritorious student"
>>> capt = RegexpTokenizer('[A-Z]w+')
>>> capt.tokenize(sent)
['She', 'She'] 

The following code shows how a predefined Regular Expression is used by a subclass of RegexpTokenizer:

>>> import nltk
>>> sent=" She secured 90.56 % in class X . She is a meritorious student"
>>> from nltk.tokenize import BlanklineTokenizer
>>> BlanklineTokenizer().tokenize(sent)
[' She secured 90.56 % in class X 
. She is a meritorious student
']

The tokenization of strings can be done using whitespace—tab, space, or newline:

>>> import nltk
>>> sent=" She secured 90.56 % in class X . She is a meritorious student"
>>> from nltk.tokenize import WhitespaceTokenizer
>>> WhitespaceTokenizer().tokenize(sent)
['She', 'secured', '90.56', '%', 'in', 'class', 'X', '.', 'She', 'is', 'a', 'meritorious', 'student']

WordPunctTokenizer makes use of the regular expression w+|[^ws]+ to perform the tokenization of text into alphabetic and non-alphabetic characters.

Tokenization using the split() method is depicted in the following code:

>>>import nltk
>>>sent= She secured 90.56 % in class X. She is a meritorious student"
>>> sent.split()
['She', 'secured', '90.56', '%', 'in', 'class', 'X', '.', 'She', 'is', 'a', 'meritorious', 'student']
>>> sent.split('')
['', 'She', 'secured', '90.56', '%', 'in', 'class', 'X', '.', 'She', 'is', 'a', 'meritorious', 'student']
>>> sent=" She secured 90.56 % in class X 
. She is a meritorious student
"
>>> sent.split('
')
[' She secured 90.56 % in class X ', '. She is a meritorious student', '']

Similar to sent.split(' '), LineTokenizer works by tokenizing text into lines:

>>> import nltk
>>> from nltk.tokenize import BlanklineTokenizer
>>> sent=" She secured 90.56 % in class X 
. She is a meritorious student
"
>>> BlanklineTokenizer().tokenize(sent)
[' She secured 90.56 % in class X 
. She is a meritorious student
']
>>> from nltk.tokenize import LineTokenizer
>>> LineTokenizer(blanklines='keep').tokenize(sent)
[' She secured 90.56 % in class X ', '. She is a meritorious student']
>>> LineTokenizer(blanklines='discard').tokenize(sent)
[' She secured 90.56 % in class X ', '. She is a meritorious student']

SpaceTokenizer works similar to sent.split(''):

>>> import nltk
>>> sent=" She secured 90.56 % in class X 
. She is a meritorious student
"
>>> from nltk.tokenize import SpaceTokenizer
>>> SpaceTokenizer().tokenize(sent)
['', 'She', 'secured', '90.56', '%', 'in', 'class', 'X', '
.', 'She', 'is', 'a', 'meritorious', 'student
']

nltk.tokenize.util module works by returning the sequence of tuples that are offsets of the tokens in a sentence:

>>> import nltk
>>> from nltk.tokenize import WhitespaceTokenizer
>>> sent=" She secured 90.56 % in class X 
. She is a meritorious student
"
>>> list(WhitespaceTokenizer().span_tokenize(sent))
[(1, 4), (5, 12), (13, 18), (19, 20), (21, 23), (24, 29), (30, 31), (33, 34), (35, 38), (39, 41), (42, 43), (44, 55), (56, 63)]

Given a sequence of spans, the sequence of relative spans can be returned:

>>> import nltk
>>> from nltk.tokenize import WhitespaceTokenizer
>>> from nltk.tokenize.util import spans_to_relative
>>> sent=" She secured 90.56 % in class X 
. She is a meritorious student
"
>>>list(spans_to_relative(WhitespaceTokenizer().span_tokenize(sent)))
[(1, 3), (1, 7), (1, 5), (1, 1), (1, 2), (1, 5), (1, 1), (2, 1), (1, 3), (1, 2), (1, 1), (1, 11), (1, 7)]

nltk.tokenize.util.string_span_tokenize(sent,separator) will return the offsets of tokens in sent by splitting at each incidence of the separator:

>>> import nltk
>>> from nltk.tokenize.util import string_span_tokenize
>>> sent=" She secured 90.56 % in class X 
. She is a meritorious student
"
>>> list(string_span_tokenize(sent, ""))
[(1, 4), (5, 12), (13, 18), (19, 20), (21, 23), (24, 29), (30, 31), (32, 34), (35, 38), (39, 41), (42, 43), (44, 55), (56, 64)]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset