Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

A. Singh et al.Building an Enterprise Chatbothttps://doi.org/10.1007/978-1-4842-5034-1_5

5. Natural Language Processing, Understanding, and Generation

Abhishek Singh¹, Karthik Ramasubramanian¹ and Shrey Shivam²

(1)

New Delhi, Delhi, India

(2)

Donegal, Donegal, Ireland

The human brain is one of the most advanced machines when it comes to processing, understanding, and generating (P-U-G) natural language. The capabilities of the human brain stretch far beyond just being able to perform P-U-G on one language, dialect, accent, and conversational undertone. No machine has so far reached the human potential of performing all three tasks seamlessly. However, the advances in machine learning algorithms and computing power are making the distant dream of creating human-like bots a possibility.

In this chapter, we will explore the P-U-G of natural languages and their nuances with references to use cases and examples. Table 5-1 provides a quick summary of natural language processing (NLP), natural language understanding (NLU), and natural language generation (NLG) with a few functions and real-world applications. We will get into more details on natural language processing, understanding, and generation in their respective sections.

Table 5-1

NLP, NLU, and NLG

Type	NLP	NLU	NLG
Brief	Process and analyze written or spoken text by breaking it down, comprehending its meaning, and determining the appropriate action. It involves parsing, sentence breaking, and stemming.	A specific type of NLP that helps to deal with reading comprehension, which includes the ability to understand meaning from its discourse content and identify the main thought of a passage.	NLG is one of the tasks of NLP to generate natural language text from structured data from a knowledge base. In other words, it transforms data into a written narrative.
Functions	Identify part of speech, text categorizing, named entity recognition, translation, speech recognition	Automatic summarization, semantic parsing, question answering, sentiment analysis	Content determination, document structuring, generating text in interactive conversation
Real-World Application	Article classification for digital news aggregation company	Building a Q&A chatbot, brand sentiment using Twitter and Facebook data	Generating a product description for an e-commerce website or a financial portfolio summary

Chatbot Architecture

When it comes to building an enterprise chatbot, you have so far seen how to identify data sources, design the chatbot architecture, list business use cases, and many other concepts that help an enterprise to process efficiently, reduce manual labor, and reduce the cost of operations. In this chapter, we will focus on the core part of a chatbot: the ability to process textual data and take part in a human-like conversation. Figure 5-1 shows an architecture that utilizes the techniques from NLP, NLU, and NLG to build an enterprise chatbot.

../images/478492_1_En_5_Chapter/478492_1_En_5_Fig1_HTML.jpg — Figure 5-1
Architecture diagram for chatbots

Let’s say an airline company has built a chatbot to book a flight via their website or social media pages. The following are the steps as per the architecture shown in Figure 5-1:

1.
Customer says, “Help me book a flight for tomorrow from London to New York” through the airline’s Facebook page. In this case, Facebook becomes the presentation layer. A fully functional chatbot could be integrated into a company’s website, social network page, and messaging apps like Skype and Slack.
2.
Next, the message is carried to the messaging backend where the plain text passes through an NLP/NLU engine, where the text is broken into tokens, and the message is converted into a machine-understandable command. We will revisit this in greater detail throughout this chapter.
3.
The decision engine then matches the command with preconfigured workflows. So, for example, to book a flight, the system needs a source and a destination. This is where NLG helps. The chatbot will ask, “Sure, I will help in you booking your flight from London to New York. Could you please let me know if you prefer your flight from Heathrow or Gatwick Airport?” The chatbot picks up the source and destination and automatically generates a follow-up question asking which airport the customer prefers.
4.
The chatbot now hits the data layer and fetches the flight information from prefed data sources, which could typically be connected to live booking systems. The data source provides flight availability, price, and many other services as per the design.

Some chatbots are heavy on generative responses, and others are built for retrieving information and fitting it in a predesigned conversational flow. For example, in the flight booking use case, we almost know all the possible ways the customer could ask to book a flight, whereas if we take an example of a chatbot for a telemedicine company, we are not sure about all the possible questions a patient could ask. So, in the telemedicine company chatbot, we need the help of generative models built using NLG techniques, whereas in the flight booking chatbot, a good retrieval-based system with NLP and an NLP engine should work.

Since this book is about building an enterprise chatbot, we will focus more on the applications of P-U-G in natural languages rather than going deep into the foundations of the subject. In the next section, we’ll show various techniques for NLP and NLU using some of the most popular tools in Python. There are other Java and C# bases libraries; however, Python libraries provide more significant community support and faster development.

Further, to differentiate between NLP and NLU, the Venn diagram in Figure 5-2 shows a few applications of NLP and NLU. It shows NLU as a subset of NLP. The segregation is only in the tasks, not in the scope. The overall objective is to process and understand the natural language text to make machines think like humans.

../images/478492_1_En_5_Chapter/478492_1_En_5_Fig2_HTML.jpg — Figure 5-2
Applications of NLP and NLU

Popular Open Source NLP and NLU Tools

In this section, we will briefly explore various open source tools available to perform natural language processing, understanding, and generation. While each of these tools does not differentiate between the P-U-G of natural language, we will demonstrate the capabilities of tools under the corresponding three separate headings.

NLTK

The Natural Language Toolkit (NLTK) is a Python library for processing English vocabulary. It has an Apache 2.0 open source license. NLTK is written in the Python programming language. The following are some of the tasks NLTK can perform:

Classification of text: Classifying text into a different category for better organization and content filtering
Tokenization of sentences: Breaking sentences into words for symbolic and statistical natural language processing
Stemming words: Reducing words into base or root form
Part-of-speech (POS) tagging: Tagging the words into POS, which categorizes the words into similar grammatical properties
Parsing text: Determining the syntactic structure of text based on the underlying grammar
Semantic reasoning: Ability to understand the meaning of the word to create representations

NLTK is the first choice of a tool for teaching NLP. It is also widely used as a platform for prototyping and research.

spaCy

Most organizations that build a product involving natural language data are adapting spaCy. It stands out with its offering of a production-grade NLP engine that is accurate and fast. With the extensive documentation, the adaption rate further increases. It is developed in Python and Cython. All the language models in spaCy are trained using deep learning, which provides high accuracy for all NLP tasks.

Currently, the following are some high-level capabilities of spaCy:

Covers NLTK features: Provides all the features of NLTK-like tokenization, POS tagging, dependency trees, named entity recognition, and many more.
Deep learning workflow: spaCy supports deep learning workflows, which can connect to models trained on popular frameworks like Tensorflow, Keras, Scikit-learn, and PyTorch. This makes spaCy the most potent library when it comes to building and deploying sophisticated language models for real-world applications.
Multi-language support: Provides support for more than 50 languages including French, Spanish, and Greek.
Processing pipeline: Offers an easy-to-use and very intuitive processing pipeline for performing a series of NLP tasks in an organized manner. For example, a pipeline for performing POS tagging, parsing the sentence, and named the entity extraction could be defined in a list like this: pipeline = ["tagger," "parse," "ner"]. This makes the code easy to read and quick to debug.
Visualizers: Using displaCy, it becomes easy to draw a dependency tree and entity recognizer. We can add our colors to make the visualization aesthetically pleasing and beautiful. It quickly renders in a Jupyter notebook as well.

CoreNLP

Stanford CoreNLP is one of the oldest and most robust tools for all natural language tasks. Its suite of functions offers many linguistic analysis capabilities, including the already discussed POS tagging, dependency tree, named entity recognition, sentiment analysis, and others. Unlike spaCy and NLTK, CoreNLP is written in Java. It also provides Java APIs to use from the command line and third-party APIs for working with modern programming languages. The following are the core features of using CoreNLP:

Fast and robust: Since it is written in Java, which is a time-tested and robust programming language, CoreNLP is a favorite for many developers.
A broad range of grammatical analysis: Like NLTK and spaCy, CoreNLP also provides a good number of analytical capabilities to process and understand natural language.
API integration: CoreNLP has excellent API support for running it from the command line and programming languages like Python via a third-party API or web service.
Support multiple Operating Systems (OSs): CoreNLP works in Windows, Linux, and MacOS.
Language support: Like spaCy, CoreNLP provides useful language support, which includes Arabic, Chinese, and many more.

gensim

gensim is a popular library written in Python and Cython. It is robust and production-ready, which makes it another popular choice for NLP and NLU. It can help analyze the semantic structure of plain-text documents and come out with important topics. The following are some core features of gensim:

Topic modeling: It automatically extracts semantic topics from documents. It provides various statistical models, including latent Dirichlet analysis (LDA) for topic modeling.
Pretrained models: It has many pretrained models that provide out-of-the-box capabilities to develop general-purpose functionalities quickly.
Similarity retrieval: gensim’s capability to extract semantic structures from any document makes it an ideal library for similarity queries on numerous topics.

Table 5-2 from the spaCy website summarizes if a given NLP feature is available in NLTK, spaCy, and CoreNLP.

Table 5-2

Features available in spaCy, NLTK, and CoreNLP

S.No.	Feature	spaCy	NLTK	CoreNLP
1	Programming language	Python	Python	Java/Python
2	Neural network models	Yes	No	Yes
3	Integrated word vectors	Yes	No	No
4	Multi-language support	Yes	Yes	Yes
5	Tokenization	Yes	Yes	Yes
6	Part-of-speech tagging	Yes	Yes	Yes
7	Sentence segmentation	Yes	Yes	Yes
8	Dependency parsing	Yes	No	Yes
9	Entity recognition	Yes	Yes	Yes
10	Entity linking	No	No	No
11	Coreference resolution	No	No	Yes

TextBlob

TextBlob is a relatively less popular but easy-to-use Python library that provides various NLP capabilities like the libraries discussed above. It extends the features provided by NLTK but in a much-simplified form. The following are some of the features of TextBlob:

Sentiment analysis: It provides an easy-to-use method for computing polarity and subjectivity kinds of scores that measures the sentiment of a given text.
Language translations: Its language translation is powered by Google Translate, which provides support for more than 100 languages.
Spelling corrections: It uses a simple spelling correction method demonstrated by Peter Norvig on his blog at http://norvig.com/spell-correct.html . Currently the Engineering Director at Google, his approach is 70% accurate.

fastText

fasText is a specialized library for learning word embeddings and text classification. It was developed by researchers in Facebook’s FAI Research (FAIR) lab. It is written in C++ and Python, making it very efficient and fast in processing even a large chunk of data. The following are some of the features of fastText:

Word embedding learnings: Provides many word embedding models using skipgram and Continous Bag of Words (CBOW) by unsupervised training.
Word vectors for out-of-vocabulary words: It provides the capability to obtain word vectors even if the word is not present in the training vocabulary.
Text classification: fastText provides a fast text classifier, which in their paper titled “Bag of Tricks for Efficient Text Classification” claims to be often at par with many deep learning classifiers’ accuracy and training time.

In the next few sections, you will see how to apply these tools to perform various tasks in NLP, NLU, and NLG.

Natural Language Processing

Language skills are considered the most sophisticated tasks that a human can perform. Natural language processing deals with understanding and manicuring natural language text or speech to perform specific useful desired tasks. NLP combines ideas and concepts from computer science, linguistics, mathematics, artificial intelligence, machine learning, and psychology.

Mining information from unstructured textual data is not as straightforward as performing a database query using SQL. Categorizing documents based on keywords, identifying a mention of a brand in a social media post, and tracking the popularity of a leader on Twitter are all possible if we can identify entities like a person, organization, and other useful information.

The primary tasks in NLP are processing and analyzing written or spoken text by breaking it down, comprehending its meaning, and determining appropriate action. It involves parsing, sentence breaking, stemming, dependency tree, entity extraction, and text categorization.

We will see how words in a language are broken into smaller tokens and how various transformations work (transforming textual data into a structured and numeric value). We will also explore popular libraries like NLTK, TextBlob, spaCy, CoreNLP, and fastText.

Processing Textual Data

We will use the Amazon Fine Food Review dataset throughout this chapter for all demonstrations using various open-source tools. The dataset can be downloaded from www.kaggle.com/snap/amazon-fine-food-reviews , which is made available with a CC0: Public Domain license.

Reading the CSV File

Using a read_csv function from the pandas library, we read the Reviews.csv file into a food_review data frame and print the top rows (Figure 5-3):

import pandas as pd

food_review = pd.read_csv("Reviews.csv")

food_review.head()

../images/478492_1_En_5_Chapter/478492_1_En_5_Fig3_HTML.jpg — Figure 5-3
A CSV file

As can be seen, the CSV contains columns like ProductID, UserID, Product Rating, Time, Summary, and Text of the review. The file contains almost 500K reviews for various products. Let’s sample some reviews to process.

Sampling

Using the sample function from the pandas data frame, let’s randomly pick the text of 1000 reviews and print the top rows (see Figure 5-4):

food_review_text = pd.DataFrame(food_review["Text"])

food_review_text_1k = food_review_text.sample(n= 1000,random_state = 123)

food_review_text_1k.head()

../images/478492_1_En_5_Chapter/478492_1_En_5_Fig4_HTML.jpg — Figure 5-4
Samples

Tokenization Using NLTK

As discussed, NLTK offers many features for processing textual data. The first step in processing text data is to separate a sentence into individual words. This process is called tokenization. We will use the NLTK’s word_tokenize function to create a column in the food_review_text_1k data frame we created above and print the top six rows to see the output of tokenize (Figure 5-5):

food_review_text_1k['tokenized_reviews'] = food_review_text_1k['Text'].apply(nltk.word_tokenize)

food_review_text_1k.head()

../images/478492_1_En_5_Chapter/478492_1_En_5_Fig5_HTML.jpg — Figure 5-5
Top rows

Word Search Using Regex

Now that we have the tokenized text for each review, let’s take the first row in the data frame and search for the presence of the word using a regular expression (regex) . The regex searches for any word that contains c as its first character and i as the third character. We can write various regex searches for a pattern of interest. We use the re.search() function to perform this search:

#Search: All 5-letter words with c as its first letter and i as its third letter

search_word = set([w for w in food_review_text_1k['tokenized_reviews'].iloc[0] if re.search('^c.i..$', w)])

print(search_word)

{'chips'}

Word Search Using the Exact Word

Another way of searching for a word is to use the exact word. This can be achieved using the str.contains() function in pandas. In the following example, we search for the word “great” in all of the reviews. The rows of the reviews containing the word will be retrieved. They can be considered a positive review. See Figure 5-6.

#Search for the word "great" in reviews

food_review_text_1k[food_review_text_1k['Text'].str.contains('great')]

../images/478492_1_En_5_Chapter/478492_1_En_5_Fig6_HTML.jpg — Figure 5-6
Samples with a specific word

NLTK

In this section, we will use many of the features from NLTK for NLP, such as normalization, noun phrase chunking, named entity recognition, and document classifier.

Normalization Using NLTK

In many natural language tasks, we often deal with the root form of the words. For example, for the words “baking” and “baked,” the root word is “bake.” This process of extracting the root word is called stemming or normalization. NLTK provides two functions implementing the stemming algorithm. The first is the Porter Stemming algorithm, and the second is the Lancaster stemmer.

There are slight differences in the quality of output from both algorithms. For example, in the following example, the Porter stemmer converts the word “sustenance” into “sustain” while the Lancaster stemmer outputs “sust.”

words = set(food_review_text_1k['tokenized_reviews'].iloc[0])

print(words)

porter = nltk.PorterStemmer()

print([porter.stem(w) for w in words])

Before

{'when', 'always', 'great', 'vending', 'for', 'make', "'m", 'just', 'I', '.', 'love', 'a', 'They', 'with', 'healthy', 'these', 'snack', 'the', 'at', 'work', 'chips', 'machine', 'stuck', 'sustenance', '!'}

After

['when', 'alway', 'great', 'vend', 'for', 'make', "'m", 'just', 'I', '.', 'love', 'a', 'they', 'with', 'healthi', 'these', 'snack', 'the', 'at', 'work', 'chip', 'machin', 'stuck', 'susten', '!']

lancaster = nltk.LancasterStemmer()

print([lancaster.stem(w) for w in words])

['when', 'alway', 'gre', 'vend', 'for', 'mak', "'m", 'just', 'i', '.', 'lov', 'a', 'they', 'with', 'healthy', 'thes', 'snack', 'the', 'at', 'work', 'chip', 'machin', 'stuck', 'sust', '!']

Noun Phrase Chunking Using Regular Expressions

Above you saw the tokens as a fundamental unit in any NLP processing. Since in natural language, a group of tokens combined often reveals the meaning or represents a concept, we create chunks. Multi-token sequences are created by segmenting using a process called chunking. In Figure 5-7, the smaller boxes show word-level tokenization and the larger boxes shows multi-token sequences, also called higher-level chunks. Such chunks are created using regular expressions or by using the n-gram (more on this in later sections) method. Chunking is essential for entity recognition, which we will shortly explore.

../images/478492_1_En_5_Chapter/478492_1_En_5_Fig7_HTML.jpg — Figure 5-7
Tokens and chunks

Let’s consider a single review as shown in the following code. The grammar finds a noun using a rule that says, find noun chunk where zero or one (?) determiner (DT) is followed by any number (*) of adjectives (JJ) and a noun (NN). In the POS tree shown in the output of the following code, all the chunks marked as NP are the noun phrases:

import nltk

from nltk.tokenize import word_tokenize

#Noun phrase chunking

text = word_tokenize("My English Bulldog Larry had skin allergies the summer we got him at age 3, I'm so glad that now I can buy his food from Amazon")

#This grammar rule: Find NP chunk when an optional determiner (DT) is followed by any number of adjectives (JJ) and then a noun (NN)

grammar = "NP: {<DT>?<JJ>*<NN>}"

#Regular expression parser using the above grammar

cp = nltk.RegexpParser(grammar)

#Parsed text with pos tag

review_chunking_out = cp.parse(nltk.pos_tag(text))

#Print the parsed text

print(review_chunking_out)

My/PRP$

English/JJ

Bulldog/NNP

Larry/NNP

had/VBD

skin/VBN

allergies/NNS

(NP the/DT summer/NN)

we/PRP

got/VBD

him/PRP

at/IN

(NP age/NN)

3/CD

,/,

I/PRP

'm/VBP

so/RB

glad/JJ

that/IN

now/RB

I/PRP

can/MD

buy/VB

his/PRP$

(NP food/NN)

from/IN

Amazon/NNP)

You can see many NPs such as “the summer” and “age” where “the summer” is not a single word token. Above you see that the POS is in a tree representation. Another way of representing the chunk structures is by using tags. The IOB tag representation is a general standard. In this scheme, each token is represented as I (Inside), O (Outside), and B (Begin). Chunk tag B represents the beginning of a chunk. Subsequent tokens within a chunk are tagged I and all other tokens are tagged O. Figure 5-8 provides one example of an IOB tag representation.

../images/478492_1_En_5_Chapter/478492_1_En_5_Fig8_HTML.jpg — Figure 5-8
IOB tag representation of chunk structures

The following code uses the CoNLL 2000 Corpus to convert the tree to tags using the function tree2conlltags(). CoNLL is Wall Street Journal text that has been tagged and chunked using IOB notation.

from nltk.chunk import conlltags2tree, tree2conlltags

from pprint import pprint

#Print IOB tags

review_chunking_out_IOB = tree2conlltags(review_chunking_out)

pprint(review_chunking_out_IOB)

[('My', 'PRP$', 'O'),

('English', 'JJ', 'O'),

('Bulldog', 'NNP', 'O'),

('Larry', 'NNP', 'O'),

('had', 'VBD', 'O'),

('skin', 'VBN', 'O'),

('allergies', 'NNS', 'O'),

('the', 'DT', 'B-NP'),

('summer', 'NN', 'I-NP'),

('we', 'PRP', 'O'),

('got', 'VBD', 'O'),

('him', 'PRP', 'O'),

('at', 'IN', 'O'),

('age', 'NN', 'B-NP'),

('3', 'CD', 'O'),

(',', ',', 'O'),

('I', 'PRP', 'O'),

("'m", 'VBP', 'O'),

('so', 'RB', 'O'),

('glad', 'JJ', 'O'),

('that', 'IN', 'O'),

('now', 'RB', 'O'),

('I', 'PRP', 'O'),

('can', 'MD', 'O'),

('buy', 'VB', 'O'),

('his', 'PRP$', 'O'),

('food', 'NN', 'B-NP'),

('from', 'IN', 'O'),

('Amazon', 'NNP', 'O')]

Named Entity Recognition

Once we have the POS of the text, we can extract the named entities. Named entities are definite noun phrases that refer to specific individuals such as ORGANIZATION and PERSON. Some other entities are LOCATION, DATE, TIME, MONEY, PERCENT, FACILITY, and GPE. The facility is any human-made artifact in the architecture and civil engineering domain, such as Taj Mahal or Empire State Building. GPE means geopolitical entities such as city, state, and country. We can extract all these entities using the ne_chunk() method in the nltk library.

The following code uses the POS tagged sentence and applies the ne_chunk() method to it. It identifies Amazon as GPE and Bulldog Larry as a PERSON. In this case, this is both true and false. Amazon is identified as ORGANIZATION, which we expect here. Later in the chapter, we will train our own named entity recognizer to improve the performance.

tagged_review_sent = nltk.pos_tag(text)

print(nltk.ne_chunk(tagged_review_sent))

My/PRP$

English/JJ

(PERSON Bulldog/NNP Larry/NNP)

had/VBD

skin/VBN

allergies/NNS

the/DT

summer/NN

we/PRP

got/VBD

him/PRP

at/IN

age/NN

3/CD

,/,

I/PRP

'm/VBP

so/RB

glad/JJ

that/IN

now/RB

I/PRP

can/MD

buy/VB

his/PRP$

food/NN

from/IN

(GPE Amazon/NNP))