The human brain is one of the most advanced machines when it comes to processing, understanding, and generating (P-U-G) natural language. The capabilities of the human brain stretch far beyond just being able to perform P-U-G on one language, dialect, accent, and conversational undertone. No machine has so far reached the human potential of performing all three tasks seamlessly. However, the advances in machine learning algorithms and computing power are making the distant dream of creating human-like bots a possibility.
NLP, NLU, and NLG
Type | NLP | NLU | NLG |
---|---|---|---|
Brief | Process and analyze written or spoken text by breaking it down, comprehending its meaning, and determining the appropriate action. It involves parsing, sentence breaking, and stemming. | A specific type of NLP that helps to deal with reading comprehension, which includes the ability to understand meaning from its discourse content and identify the main thought of a passage. | NLG is one of the tasks of NLP to generate natural language text from structured data from a knowledge base. In other words, it transforms data into a written narrative. |
Functions | Identify part of speech, text categorizing, named entity recognition, translation, speech recognition | Automatic summarization, semantic parsing, question answering, sentiment analysis | Content determination, document structuring, generating text in interactive conversation |
Real-World Application | Article classification for digital news aggregation company | Building a Q&A chatbot, brand sentiment using Twitter and Facebook data | Generating a product description for an e-commerce website or a financial portfolio summary |
Chatbot Architecture
- 1.
Customer says, “Help me book a flight for tomorrow from London to New York” through the airline’s Facebook page. In this case, Facebook becomes the presentation layer. A fully functional chatbot could be integrated into a company’s website, social network page, and messaging apps like Skype and Slack.
- 2.
Next, the message is carried to the messaging backend where the plain text passes through an NLP/NLU engine, where the text is broken into tokens, and the message is converted into a machine-understandable command. We will revisit this in greater detail throughout this chapter.
- 3.
The decision engine then matches the command with preconfigured workflows. So, for example, to book a flight, the system needs a source and a destination. This is where NLG helps. The chatbot will ask, “Sure, I will help in you booking your flight from London to New York. Could you please let me know if you prefer your flight from Heathrow or Gatwick Airport?” The chatbot picks up the source and destination and automatically generates a follow-up question asking which airport the customer prefers.
- 4.
The chatbot now hits the data layer and fetches the flight information from prefed data sources, which could typically be connected to live booking systems. The data source provides flight availability, price, and many other services as per the design.
Some chatbots are heavy on generative responses, and others are built for retrieving information and fitting it in a predesigned conversational flow. For example, in the flight booking use case, we almost know all the possible ways the customer could ask to book a flight, whereas if we take an example of a chatbot for a telemedicine company, we are not sure about all the possible questions a patient could ask. So, in the telemedicine company chatbot, we need the help of generative models built using NLG techniques, whereas in the flight booking chatbot, a good retrieval-based system with NLP and an NLP engine should work.
Since this book is about building an enterprise chatbot, we will focus more on the applications of P-U-G in natural languages rather than going deep into the foundations of the subject. In the next section, we’ll show various techniques for NLP and NLU using some of the most popular tools in Python. There are other Java and C# bases libraries; however, Python libraries provide more significant community support and faster development.
Popular Open Source NLP and NLU Tools
In this section, we will briefly explore various open source tools available to perform natural language processing, understanding, and generation. While each of these tools does not differentiate between the P-U-G of natural language, we will demonstrate the capabilities of tools under the corresponding three separate headings.
NLTK
Classification of text: Classifying text into a different category for better organization and content filtering
Tokenization of sentences: Breaking sentences into words for symbolic and statistical natural language processing
Stemming words: Reducing words into base or root form
Part-of-speech (POS) tagging: Tagging the words into POS, which categorizes the words into similar grammatical properties
Parsing text: Determining the syntactic structure of text based on the underlying grammar
Semantic reasoning: Ability to understand the meaning of the word to create representations
NLTK is the first choice of a tool for teaching NLP. It is also widely used as a platform for prototyping and research.
spaCy
Most organizations that build a product involving natural language data are adapting spaCy. It stands out with its offering of a production-grade NLP engine that is accurate and fast. With the extensive documentation, the adaption rate further increases. It is developed in Python and Cython. All the language models in spaCy are trained using deep learning, which provides high accuracy for all NLP tasks.
Covers NLTK features: Provides all the features of NLTK-like tokenization, POS tagging, dependency trees, named entity recognition, and many more.
Deep learning workflow: spaCy supports deep learning workflows, which can connect to models trained on popular frameworks like Tensorflow, Keras, Scikit-learn, and PyTorch. This makes spaCy the most potent library when it comes to building and deploying sophisticated language models for real-world applications.
Multi-language support: Provides support for more than 50 languages including French, Spanish, and Greek.
Processing pipeline: Offers an easy-to-use and very intuitive processing pipeline for performing a series of NLP tasks in an organized manner. For example, a pipeline for performing POS tagging, parsing the sentence, and named the entity extraction could be defined in a list like this: pipeline = ["tagger," "parse," "ner"]. This makes the code easy to read and quick to debug.
Visualizers: Using displaCy, it becomes easy to draw a dependency tree and entity recognizer. We can add our colors to make the visualization aesthetically pleasing and beautiful. It quickly renders in a Jupyter notebook as well.
CoreNLP
Fast and robust: Since it is written in Java, which is a time-tested and robust programming language, CoreNLP is a favorite for many developers.
A broad range of grammatical analysis: Like NLTK and spaCy, CoreNLP also provides a good number of analytical capabilities to process and understand natural language.
API integration: CoreNLP has excellent API support for running it from the command line and programming languages like Python via a third-party API or web service.
Support multiple Operating Systems (OSs): CoreNLP works in Windows, Linux, and MacOS.
Language support: Like spaCy, CoreNLP provides useful language support, which includes Arabic, Chinese, and many more.
gensim
Topic modeling: It automatically extracts semantic topics from documents. It provides various statistical models, including latent Dirichlet analysis (LDA) for topic modeling.
Pretrained models: It has many pretrained models that provide out-of-the-box capabilities to develop general-purpose functionalities quickly.
Similarity retrieval: gensim’s capability to extract semantic structures from any document makes it an ideal library for similarity queries on numerous topics.
Features available in spaCy, NLTK, and CoreNLP
S.No. | Feature | spaCy | NLTK | CoreNLP |
---|---|---|---|---|
1 | Programming language | Python | Python | Java/Python |
2 | Neural network models | Yes | No | Yes |
3 | Integrated word vectors | Yes | No | No |
4 | Multi-language support | Yes | Yes | Yes |
5 | Tokenization | Yes | Yes | Yes |
6 | Part-of-speech tagging | Yes | Yes | Yes |
7 | Sentence segmentation | Yes | Yes | Yes |
8 | Dependency parsing | Yes | No | Yes |
9 | Entity recognition | Yes | Yes | Yes |
10 | Entity linking | No | No | No |
11 | Coreference resolution | No | No | Yes |
TextBlob
Sentiment analysis: It provides an easy-to-use method for computing polarity and subjectivity kinds of scores that measures the sentiment of a given text.
Language translations: Its language translation is powered by Google Translate, which provides support for more than 100 languages.
Spelling corrections: It uses a simple spelling correction method demonstrated by Peter Norvig on his blog at http://norvig.com/spell-correct.html . Currently the Engineering Director at Google, his approach is 70% accurate.
fastText
Word embedding learnings: Provides many word embedding models using skipgram and Continous Bag of Words (CBOW) by unsupervised training.
Word vectors for out-of-vocabulary words: It provides the capability to obtain word vectors even if the word is not present in the training vocabulary.
Text classification: fastText provides a fast text classifier, which in their paper titled “Bag of Tricks for Efficient Text Classification” claims to be often at par with many deep learning classifiers’ accuracy and training time.
In the next few sections, you will see how to apply these tools to perform various tasks in NLP, NLU, and NLG.
Natural Language Processing
Language skills are considered the most sophisticated tasks that a human can perform. Natural language processing deals with understanding and manicuring natural language text or speech to perform specific useful desired tasks. NLP combines ideas and concepts from computer science, linguistics, mathematics, artificial intelligence, machine learning, and psychology.
Mining information from unstructured textual data is not as straightforward as performing a database query using SQL. Categorizing documents based on keywords, identifying a mention of a brand in a social media post, and tracking the popularity of a leader on Twitter are all possible if we can identify entities like a person, organization, and other useful information.
The primary tasks in NLP are processing and analyzing written or spoken text by breaking it down, comprehending its meaning, and determining appropriate action. It involves parsing, sentence breaking, stemming, dependency tree, entity extraction, and text categorization.
We will see how words in a language are broken into smaller tokens and how various transformations work (transforming textual data into a structured and numeric value). We will also explore popular libraries like NLTK, TextBlob, spaCy, CoreNLP, and fastText.
Processing Textual Data
We will use the Amazon Fine Food Review dataset throughout this chapter for all demonstrations using various open-source tools. The dataset can be downloaded from www.kaggle.com/snap/amazon-fine-food-reviews , which is made available with a CC0: Public Domain license.
Reading the CSV File
As can be seen, the CSV contains columns like ProductID, UserID, Product Rating, Time, Summary, and Text of the review. The file contains almost 500K reviews for various products. Let’s sample some reviews to process.
Sampling
Tokenization Using NLTK
Word Search Using Regex
Word Search Using the Exact Word
NLTK
In this section, we will use many of the features from NLTK for NLP, such as normalization, noun phrase chunking, named entity recognition, and document classifier.
Normalization Using NLTK
In many natural language tasks, we often deal with the root form of the words. For example, for the words “baking” and “baked,” the root word is “bake.” This process of extracting the root word is called stemming or normalization. NLTK provides two functions implementing the stemming algorithm. The first is the Porter Stemming algorithm, and the second is the Lancaster stemmer.
Noun Phrase Chunking Using Regular Expressions
Named Entity Recognition
Once we have the POS of the text, we can extract the named entities. Named entities are definite noun phrases that refer to specific individuals such as ORGANIZATION and PERSON. Some other entities are LOCATION, DATE, TIME, MONEY, PERCENT, FACILITY, and GPE. The facility is any human-made artifact in the architecture and civil engineering domain, such as Taj Mahal or Empire State Building. GPE means geopolitical entities such as city, state, and country. We can extract all these entities using the ne_chunk() method in the nltk library.
spaCy
While spaCy offers all the features of NLTK, it is regarded as one of the best production grade tools for an NLP task. In this section, we will see how to use the various methods provided by the spaCy library in Python.
spaCy provides three core models: en_core_web_sm (10MB), en_core_web_md (91MB), and en_core_web_lg (788MB). The larger model is trained on bigger vocabulary and hence will give higher accuracy. So depending on your use case, choose wisely the model that fits your requirements.
POS Tagging
text: The original text
lemma: Token after stemming, which is the base form of the word
pos: Part of speech
tag: POS with details
dep: The relationship between the tokens. Also called syntactical dependency.
shape: The shape of the word (i.e., capitalization, punctuation, digits)
is_alpha: Returns True if the token is an alphanumeric character
is.stop: Returns True if the token is a stopword like “at,” “so,” etc.
Dependency Parsing
text: Original noun chunk
root.text: Original word connecting the noun chunk to the rest of the noun chunk parse
root.dep: Dependency relation connecting the root to its head
root.head: Root token’s head
Dependency Tree
From the dependency trees, you can see that there are two compound word pairs, “English Bulldog” and “skin allergies,” and NUM “3” is the modifier of “age.” You can also see “summer” as the noun phrase as an adverbial modifier (npadvmod) to the token “had.” You can also observe many direct objects (dobj) of a verb phrase, which is a noun phrase, like (got, him) and (had, allergies) and object of a preposition (pobj) like (at, age). A detailed explanation of the relationships in a dependency tree can be found here: https://nlp.stanford.edu/software/dependencies_manual.pdf .
Chunking
Named Entity Recognition
spaCy has an accuracy of 85.85% in named entity recognition (NER) tasks. The en_core_web_sm model provides the function ents, which provides the entities. The model is trained on the OntoNotes dataset, which can be found at https://catalog.ldc.upenn.edu/LDC2013T19 .
Types
TYPE | DESCRIPTION |
---|---|
PERSON | Names of people including fictional characters |
NORP | Nationalities or religious or political groups |
FAC | Civil engineering structures or infrastructures like buildings, airports, highways, bridges, etc. |
ORG | Organization names like companies, agencies, institutions, etc. |
GPE | A geopolitical entity like countries, cities, states |
LOC | Non-GPE locations like mountain ranges, water bodies |
PRODUCT | Objects, vehicles, foods, etc. (not services) |
EVENT | Named hurricanes, battles, wars, sports events, etc. |
WORK_OF_ART | Titles of books, songs, etc. |
LAW | Named documents made into laws |
LANGUAGE | Any named language |
DATE | Absolute or relative dates or periods |
TIME | Times smaller than a day |
PERCENT | Percentage, including % |
MONEY | Monetary values, including unit |
QUANTITY | Measurements, as of weight or distance |
ORDINAL | “first,” “second,” etc. |
CARDINAL | Numerals that do not fall under another type |
Pattern-Based Search
In the search span, if we want to find the word “Walmart,” we define this using the matcher.add method and pass pattern as the argument to the method.
Searching for Entity
Training a Custom NLP Model
CoreNLP
CoreNLP is another popular toolkit for linguistic analysis such as POS tagging, dependency tree, named entity recognition, sentiment analysis, and many others. We are going to use the CoreNLP features from Python through a third-party wrapper called Stanford-corenlp. It can be installed using pip install in the command line or cloned from GitHub here: https://github.com/Lynten/stanford-corenlp .
Once you install or download the code, you need to specify the path to the Stanford-corenlp code from where it picks up the necessary model for the various NLP tasks.
Tokenizing
Part-of-Speech Tagging
POS can be extracted using the method pos_tag in the
Named Entity Recognition
Constituency Parsing
Constituency parsing extracts a constituency-based parse tree from a given sentence that is representative of the syntactic structure according to a phase structure grammar. See Figure 5-13 for a simple example.
Dependency Parsing
TextBlob
TextBlob is a simple library for beginners in NLP. Although it offers few advanced features like machine translation, it is through a Google API. It is for simply getting to know NLP use cases and on generic datasets. For more sophisticated applications, consider using spaCy or CoreNLP.
POS Tags and Noun Phrase
Spelling Correction
Spelling correction is an exciting feature of TextBlob, which is not provided in the other libraries described in this chapter. The implementation is based on a simple technique provide by Peter Norvig, which is only 70% accurate. The method correct in TextBlob provides this implementation.
Machine Translation
Multilingual Text Processing
In this section, we will explore the various libraries and capabilities in handling languages other than English. We find the library spaCy is one of the best in terms of number of languages it supports, which currently stands at more than 50. We will try to perform language translation, POS tagging, entity extraction, and dependency parsing on text taken from the popular French news website www.lemonde.fr/ .
TextBlob for Translation
As shown in the example above, we use TextBlob for machine translation so non-French readers can understand the text we process.
POS and Dependency Relations
Its performance of French POS and dependency relation is entirely accurate. It can identify almost all the VERBS, NOUNS, ADJ, PROPN, and many other tags. Next, let’s see how it performs on the entity recognition task.
Named Entity Recognition
Noun Phrases
Natural Language Understanding
Question answering
Natural language search
Web-scale relation extraction
Sentiment analysis
Text summarization
Legal discovery
Relation extraction: Finding the relationship between instances and database tuples. The outputs are discrete values.
Semantic parsing: Parse sentences to create logical forms of text understanding, which humans are good at performing. Again, the output here is a discrete value.
Sentiment analysis: Analyze sentences to give a score in a continuous range of values. A low value means a slightly negative sentiment, and a high score means a positive sentiment.
Vector space model: Create a representation of words as a vector, which then can help in finding similar words and contextual meaning.
We will explore some of the above applications in this section.
Sentiment Analysis
TextBlob provides an easy-to-use implementation of sentiment analysis. The method sentiment takes a sentence as an input and provides polarity and subjectivity as two outputs.
Polarity
A float value within the range [-1.0, 1.0]. This scoring uses a corpus of positive, negative, and neutral words (which is called polarity) and detects the presence of a word in any of the three categories. In a simple example, the presence of a positive word is given a score of 1, -1 for negative, and 0 for neutral. We define polarity of a sentance as the average score, i.e., the sum of the scores of each word divided by the total number of words in the sentance.
If the value is less than 0, the sentiment of the sentence is negative and if it is greater than 0, it is positive; otherwise, it’s neutral.
Subjectivity
A float value within the range [0.0, 1.0]. A perfect score of 1 means “very subjective.” Unlike polarity, which reveals the sentiment of the sentence, subjectivity does not express any sentiment. The score tends to 1 if the sentence contains some personal views or beliefs. The final score of the entire sentence is calculated by assigning each word on subjectivity score and applying the averaging; the same way as polarity.
The TextBlob library internally calls the pattern library to calculate the polarity and subjectivity of a sentence. The pattern library uses SentiWordNet, which is a lexical resource for opinion mining, with polarity and subjectivity scores for all WordNet synsets. Here is the link to the SentiWordNet: https://github.com/aesuli/sentiwordnet .
Language Models
The first task of any NLP modeling is to break a given piece of text into tokens (or words), the fundamental unit of a sentence in any language. Once we have the words, we want to find the best numeric representation of the words because machines do not understand words; they need numeric values to perform computation. We will discuss two: Word2Vec (Word to a Vector) and GloVe (Global Vectors for Word Representation). For Word2Vec, a detailed explanation is provided in the next section.
Word2Vec
A skip-gram neural network model for Word2Vec computes the probability for every word in the vocabulary of being the “nearby word” that we select. Proximity or nearness of words can be defined by a parameter called window size. Figure 5-14 shows the possible pair of words for training a neural network with window size of 2.
In the input sentence, “Building an enterprise chatbot that can converse like humans” is broken into words and with a window size of 2, we take two words each from left and right of the input word. So, if the input word is “chatbot,” the output probability of the word “enterprise” will be high because of its proximity to the word “chatbot” in the window size of 2. This is only one example sentence. In a given vocabulary, we will have thousands of such sentences; the neural network will learn statistics from the number of times each pairing shows up. So, if we feed many more training samples like the one shown in Figure 5-14, it will figure out how likely the words “chatbot” and “enterprise” are going to appear together.
Neural Network Architecture
The input vector to the neural network is a one-hot vector representing the input word “chatbot,” by storing 1 in the ith position of the vector and 0 in all other positions, where 0 ≤ i ≤ n and n is the size of the vocabulary (set of all the unique words)
In the hidden layer, each word vector of size n is multiplied with a feature vector of size, let’s say 1000. When the training starts, the feature vector of size 1000 are all assigned a random value. The result of the multiplication will select the corresponding row in the n x 1000 matrix where the one-hot vector has a value of 1.
So, if the input vector representing “chatbot” is multiplied with the output vector represented by “enterprise,” the softmax function will be close to 1 because in our vocabulary, both the words appeared together very frequently.
Using the Word2Vec Pretrained Model
In the following code, we use the pretrained Word2Vec model from a favorite Python library called gensim. Word2Vec models provide a vector representation of words that make various natural language tasks possible, such as identifying similar words, finding synonyms, word arithmetic, and many more. The most popular Word2Vec models are GloVe, CBOW, and skip-gram. In this section, we will use all three models to perform various tasks of NLU.
In the demo, we use the model to perform many syntactic/semantic NLU word tasks.
review_texts: Input vocabulary to the neural network (NN).
size: The size of NN layer corresponding to the degree of freedom the algorithm has. Usually, a bigger network is more accurate, provided there is a sizeable dataset to train on. The suggested range is somewhere between ten to thousands.
min_count: This argument helps in pruning the less essential words from the vocabulary, such as words that appeared once or twice in the corpus of millions of words.
workers: The function Word2Vec offers for training parallelization, which speeds up the training process considerably. As per the official docs on gensim, you need to install Cython in order to run in parallelization mode.
Note
After installing Cython, you can run the following code to check if you have the FAST_VERSION of word2vec installed.
Performing Out-of-the-Box Tasks Using a Pretrained Model
One of the useful features of gensim is that it offers several pretrained word vectors from gensim-data. Apart from Word2Vec, it also provides GloVe, another robust unsupervised learning algorithm for finding word vectors. The following code downloads a glove-wiki-gigaword-100 word vector from gensim-data and performs some out-of-the-box tasks.
Step 2: Compute the nearest neighbors. As you have seen, word vectors contain an array of numbers representing a word. Now it becomes possible to perform mathematical computations on the vectors. For example, we can compute Euclidean or cosine similarities between any two-word vectors. There are some interesting results that we obtain as a result. The following code shows some of the outcomes.
Figure 5-14 shows an example of how the input data for training the neural network was created by shifting the window of a size 2. In the following example, you will see that “apple” on the Internet is no longer fruit; it has become synonymous with the Apple Corporation and shows many companies like it when we compute a word similar to “apple.” The reason for such similarity is because of the vocabulary used for training, which in this case is a Wikipedia dump of close to 6 billion uncased tokens. More such pretrained models are available at https://github.com/RaRe-Technologies/gensim-data .
Step 3: Identify linear substructures. The relatedness of two words is easy to compute using the similarity or distance measure, whereas to capture the nuances in a word pair or sentences in a more qualitative way, we need operations. Let’s see the methods that the gensim package offers to accomplish this task.
Word Pair Similarity
Sentence Similarity
We can also find distance or similarity between two sentences. gensim offers a distance measure called Word Mover’s distance, which has proved quite a useful tool in finding out the similarity between two documents that contain many sentences. The lower the distance, the more similar the two documents. Word Mover’s distance underneath uses the word embeddings generated by the Word2Vec model to first understand the concept of the query sentence (or document) and then find all the similar sentences or documents. For example, when we compute the Mover’s distance between two unrelated sentences, the distance is high compared to when we compare two sentences that are contextually related.
Arithmetic Operations
Even more impressive is the ability to perform arithmetic operations like addition and subtraction on the word vector to obtain some form of linear substructure because of the operation. In the first example, we compute woman + king – man, and the most similar word to this operation is queen. The underlying concept is that man and woman are genders, which may be equivalently specified by other words like queen and king. Hence, when we take out the man from the addition of woman and king, the word we obtain is queen. GloVE word representation provides few examples here: https://nlp.stanford.edu/projects/glove/ .
Odd Word Out
The model adapts to find words that are out of context in a given sequence of words. The way it works is the method doesnt_match computes the center point by taking the mean of all the word vectors in a given list of words and finding the cosine distance from the center point. The word with the highest cosine distance is returned as an odd word that does not fit in the list.
Language models like Word2Vec and GloVe are compelling in generating meaningful relationships between words, which comes naturally to a human because of our understanding of languages. It is an excellent accomplishment for machines to be able to perform at this level of intelligence in understanding the use of words in various syntactic and semantic forms.
fastText Word Representation Model
Similar to the examples discussed in this section, using either the skip-gram or CBOW model, various tasks can be performed. We can evaluate the performance to choose the best model for our final implementation.
Information Extraction Using OpenIE
The Open Information Extractor (OpenIE) annotator extracts open-domain relation triples representing subject, predicate, and object, often called a triplet. OpenIE can be a useful tool when there is minimal training data available.
The Possible Triplets from the Example Sentence Using OpenIE
S.No | Subject | Predicate | Object |
---|---|---|---|
1 | Narendra Modi | is | politician serving as 14th Prime Minister |
2 | Narendra Modi | is | Indian politician serving as 14th Prime Minister |
3 | Narendra Modi | is | politician serving as Prime Minister |
4 | Narendra Modi | is | Politician |
5 | Modi | is | Indian |
6 | Narendra Modi | is | Indian politician serving as 14th Prime Minister of India |
7 | Narendra Modi | is | Indian politician serving as Prime Minister |
8 | Narendra Modi | is | Indian politician serving as Prime Minister of India since 2014 |
9 | Narendra Modi | is | Indian politician serving as Prime Minister since 2014 |
10 | Narendra Modi | is | politician serving as Prime Minister of India since 2014 |
11 | Narendra Modi | is | politician serving as 14th Prime Minister of India since 2014 |
12 | Narendra Modi | is | politician serving as 14th Prime Minister since 2014 |
13 | Narendra Modi | is | Indian politician serving as 14th Prime Minister since 2014 |
14 | Narendra Modi | is | politician serving as Prime Minister of India |
15 | Narendra Modi | is | politician serving as Prime Minister since 2014 |
16 | Narendra Modi | is | Indian politician |
17 | Narendra Modi | is | Indian politician serving as Prime Minister of India |
18 | Narendra Modi | is | Indian politician serving since 2014 |
19 | Narendra Modi | is | politician serving as 14th Prime Minister of India |
20 | Narendra Modi | is | politician serving since 2014 |
21 | Narendra Modi | is | Indian politician serving as 14th Prime Minister of India since 2014 |
Topic Modeling Using Latent Dirichlet Allocation
Topic modeling is one of the typical applications of understanding natural language. Given a collection of documents, we can draw an “abstract topic” that represents all the docs in the collection. Latent Dirichlet allocation (LDA) is a favorite statistical model used for topic modeling. It helps in discovering the semantic structures in a given text.
In this section, for a demonstration, we will use three example reviews from the Amazon Fine Food review dataset to train an LDA model. We will see one other example of topic modeling using additional tools like spaCy, NLTK, and gensim in the “Applications” sections.
Collection of Documents
Loading Libraries and Defining Stopwords
Removing Common Words and Tokenizing
Removing Words That Appear Infrequently
Now we see the words that occur more than once. For our example, it seems like there are not many words with more than one occurrence. We expect the model not to perform very well. However, let’s still go ahead with training the model.
Saving the Training Data as a Dictionary
Generating the Bag of Words
Training the Model Using LDA
Finally, using the bag-of-words dictionary of words, we train the latent Dirichlet allocation model. LDA is a generative statistical model where, given an input variable X and target variable Y, the model based on joint probability is X ∗ Y, P(X, Y). LDA is a favorite machine learning model widely used in topic modeling. Each document (in our example, each review) is a mixture of various topics, where each document is assigned a set of topics by LDA.
The gensim library provides a method called LsiModel() , which trains an LDA model. Latent semantic indexing (LSI) is used in the context of LDA’s application in information retrieval.
- 1.
Topic 1: 0.556∗“it” + 0.542∗“tasted” + 0.428∗“as” + 0.328*“chip” + 0.328*“corn” + 0.000∗“i”
- 2.
Topic 2: -0.804∗“tasted” + 0.528∗“it” + 0.190*“corn” + 0.190*“chip” + 0.041∗“as” + 0.000∗“i”
Note that a more accurate model would need plenty of data for training and perhaps many more interesting topics might evolve.
Natural Language Generation
Natural language generation is a subfield of NLP and computational linguistics that can produce understandable human text in various languages. The ability to use the language representation and knowledge of the domain to produce documents, explanations, help messages, reports, and even poems makes NLG the most researched area just now.1 In the future, NLG will play a vital role in human-computer interfaces.
The significant differences between NLU and NLG are that NLP maps sentences into internal semantic representations (called parsing in NLU systems), whereas NLG maps the semantic representation into surface sentences (called realization in NLG systems). Both of these types of mapping are achieved through bidirectional grammar, which uses a declarative representation of a language’s grammar.
We will demonstrate NLG applications using Python- and Java-based libraries like markovify and simpleNLG. We will also use a deep learning model for text generation. Such deep learning models are behind the popular use cases where machines are writing poems or generating musical notes given a sizeable corpus of data.
Automating the documentation of code and procedures
Generating reports from financial data or annual reports
Summarizing graphical reports and numbers from tabular data
Generating discharge summaries and pathology reports
Helping meteorologists compose weather forecasts
There are many more use cases that are evolving quickly, especially with the emerging sophistication of deep learning algorithms and increasing computation power of machines.
Markov Chain-Based Headline Generator
The Markov chain model is a stochastic model describing the sequence of possible events in which the probability of each event depends only on the state achieved in the previous state. Markov chains statistically model random processes. Markov chains are defined by transition probabilities and states, where a process moves from one state to another based on a preset probability value.
Markov chain models are mathematically robust methods that give superior results if modeled correctly. Unlike many machine learning algorithms, which work in a brute-force approach, Markov chains need a diligent design to model a stochastic process.
Computer simulation of numerous real-world phenomena such as weather modeling, stock market fluctuations, and water flow in a dam
Biological modeling like population processes
Algorithmic music composition
Model boards game like Snakes and Ladders or Hi Ho! Cherry-O
Population genetics to describe changes in gene frequencies in small populations affected by genetic drift
Let’s use the markovify library from Python to generate some headlines.
Loading the Library
Loading the File and Printing the Headlines
Building a Text Model Using Markovify
Generating Random Headlines
SimpleNLG
Orthography: This refers to the conventions for writing languages. It includes capitalization, whitespaces in sentences, and paragraphs, punctuation, emphasis, and hyphenation.
Morphology: The study of words, their formation, and relationship with other words in the same language. It analyzes the structure of words and parts of words, such as stems, root words, prefixes, and suffixes.
Simple grammar: Ensures grammatical correctness like noun-verb agreement and creating well-formed verb groups (e.g. “does not play”).
In the terminology of NLG, SimpleNLG is a realizer for simple grammar. It can be useful for creating documentations and reports that need to use grammatically correct sentences. The demonstration in this section uses nglib, which is a Python library that is mainly a wrapper around SimpleNLG.
Loading the Library
Tense
Negation
Interrogative
Complements
Modifiers
Prepositional Phrases
Coordinated Clauses
Subordinate Clauses
Main Method
Printing the Output
As you can see, SimpleNLG offers an easy-to-use syntax to generate grammatically correct sentences in English programmatically. Next, let’s dive into a deep learning model to generate the next words given a piece of text. Unlike SimpleNLG, we are not sure if we’ll get a grammatically correct sentence in such a deep learning model.
Deep Learning Model for Text Generation
Text generation using deep learning is built for language models and applications like speech-to-text, conversational chatbots, and text summarizations. Such language models predict the occurrences of a word based on the previous sequence of words. Many deep learning network architectures such as recurrent neural networks are available for language modeling.
RNNs are deployed in a variety of applications like speech recognition, language modeling, translation, image captioning, and many more. Figure 5-17 shows how the hidden layers in RNNS are stacked up in a sequence of the chain. The rolled and unrolled versions help in understanding how internal processing happens.
Although RNNs are capable of picking up such long-term dependency in sentences, they requires a careful selection of parameters, which is often difficult in many practical problems. This is where LSTMs come to the rescue.
Cell state: The line that runs through from the top with few direct interactions like pairwise multiplication and addition, which could add or remove any information from the cell state.
Forget gate layer: Gates are a mechanism by which the LSTMs control how much of the information should be passed through the cell state. Here a sigmoid function is used, which has an output value between 0 to 1. If the value is 1, it means let everything pass; 0 means do not let anything pass.
Input gate layer: The sigmoid layer called an input gate layer decides which values we will update.
Tanh layer: The tanh activation function layer creates a vector of new candidate values given the input and hidden state values from the previous time step.
Loading the Library
Defining the Training Data
Data Preparation
- 1.
Convert the input review text into lowercase and split the review into sentences split by a newline character, . The split function created three sentences in the corpus. The following is the result of the operation:
- 2.
Tokenize the input reviews from the dataset. Use the Keras fit_on_text() method. The method internally represents the words in a dictionary with each word getting an index based on the frequency of its occurrence. So, if the word “the” in our review text appears the most, it gets an internal representation with the least index value like word_index["the"] = 0. In our review, except for the words “the” and “to,” all other words appear just once. The following is the output of the operation:
- 3.
Transform each word in the review into a sequence of integers. Each word gets the integer value corresponding to the index obtained using fit_on_text(). The following is the output of
- 4.
Generate n-gram values using the integer sequence for each sentence in the corpus. In each iteration of the for loop, the list input_review_sequences gets updated. In the final output, all possible n-grams of length 1 to len(token_list) get generated.
- 5.
Pad the sequence. Since each n-gram sequence is different in length, the matrix computation in the neural network would not be possible. For this reason, each n-gram sequence is padded with 0 to make it equal in length. For example, the first sequence in the list [3, 4] is padded as [ 0 0 0 0 0 0 0 0 0 0 0 0 3 4]. The following is the view of the matrix after padding:
- 6.Set the last word as the label for each n-gram sequence. For example, in the n-gram sequence [3,4] corresponding to the words [“chilling,” “in”], the label is “in.” Moreover, in the n-gram sequence [3,4,1] corresponding to the words [“chilling,” “in,” “the”], the label is “the.” Since the model is for predicting the next possible word as part of the text generation process, the sequence of predictor and label will help the neural network train on which word is more likely to occur next following a sequence of words. The following code is the label for each of the above n-gram sequences in the above matrix; observe that it is the last inter in each row of the above matrix:predictors, label = input_review_sequences[:,:-1],input_review_sequences[:,-1]print(label)[ 4 1 5 6 2 7 1 8 9 10 12 13 14 15 16 17 2 18 19 20 21 22 23 25 26 27 28 29 30 31 1 32 33 34]
- 7.As a final step in the preprocessing, we convert each label into a one-hot encoded vector to make it feasible for matrix computation in the neural network training. to_categorical() is a method from the keras.utils library. Here is the output:label = ku.to_categorical(label, num_classes=total_words)print(label)[[0. 0. 0. ... 0. 0. 0.][0. 1. 0. ... 0. 0. 0.][0. 0. 0. ... 0. 0. 0.]...[0. 0. 0. ... 1. 0. 0.][0. 0. 0. ... 0. 1. 0.][0. 0. 0. ... 0. 0. 1.]]
Creating an RNN Architecture Using a LSTM Network
- 1.
Embedding: It is a dense vector representation for each word index. The fixed integers of the predictor are converted into randomly selected dense vectors. For example, [3,4] could be converted into [[0.26, 0.14], [0.2, -0.4]]. The dimension of the dense vector is provided by the second argument, output_dim, to the embedding method in Keras. The first argument to the method is input_dim, which is the total number of words in the review. The argument input_length is set equal to the max sequence length minus 1.
- 2.
LSTM: The long short-term memory layer takes units as the dimensionality of the output space. The activation function by default is tanh, and the recurrent activation function is a hard sigmoid function by default. Other available activation functions are softmax, Rectified Linear Unit (ReLU), and others. With LSTMs it is recommended to use tanh and sigmoid.
- 3.
Dropout: RNN networks have a tendency to overfit the data. In the dropout method in Keras, it randomly sets a fraction of input units to 0 based on the value in the argument rate. In the example, the rate is set to 0.1, which means randomly drop 10% of the input units.
- 4.
Dense: The dense method creates a regular densely connected neural network. This holds the output layer where a softmax activation function is applied to give values between 0 and 1. The word with a value close to 1 is highly probable to be the next word in the sequence based on the input predictor.
Defining the Generate Text Method
Training the RNN Model
Generating Text
Applications
Topic modeling using the spaCy, NLTK, and gensim libraries: This is an extension of the topic modeling we performed using LDA earlier in the chapter. In this demonstration, we will use the combined knowledge of spaCy, NLTK, and gensim to perform various tasks in topic modeling.
Classify between male and female gender by using the person name: Using features like the last letter of a name and a corpus of male and female names, we will classify between a male and female name. This might help in filtering through the reviews and identifying any gender-based distinctions in the reviews for a product.
Given a document, classifying it into a different category: Classify a review into positive and negative. We will use the NLTK library to perform the preprocessing and classification using the Naïve Bayes classifier.
Intent classification and question answering: In this application, we will build an intent classifier and context-based question-answering utility which could be integrated with any chatbot application. We will use pretrained deep learning models using the DeepPavLov library in Python.
Topic Modeling Using spaCy, NLTK, and gensim Libraries
In the demonstration, we will use spaCy for tokenizing the review text, NLTK for the lemmatization and preprocessing the text, and the LDA model from gensim for training the model.
Tokenizing and Cleaning the Text
- 1.
Detect URLs and screen names, and append them separately into the lda_review_tokens list. This is to ensure the URLs and screen names are not processed further.
- 2.
Convert the rest of the tokens into lowercase.
Lemmatization
Preprocessing the Text Method for LDA
- 1.
Remove all the stopwords in the English vocabulary. We need to download the dataset named stopwords before we can check for the presence of them in the token.
- 2.
Extract the lemma for each token after removing the stopwords.
Reading the Training Data
Bag of Words
Training and Saving the Model
From the output above, it looks like topics 0 and 3 are about a “ginger flavor corn syrup” while topics 2 and 4 are not very clear on what they convey. Moreover, topic 1 talks about “tortilla chips.”
Predictions
Gender Identification
In this application, we use a corpus of male and female names to build a model for predicting gender from a given name. It is a simple model with the only feature as the last letter of the name. The core idea is that female and male names generally show certain distinctive features. For example, most female names end with a, e, and i. We use the NLTK library to build this model.
Loading the NLTK Library and Downloading the Names Corpus
Loading the Male and Female Names
Common Names
Extract Features
Randomly Splitting into Train and Test
Training the Model
Model Prediction
Model Accuracy
Most Informative Features
Using the show_most_informative_features() method from the model, we can see which last letters from the names are essential for classifying the male and female names.
Document Classification
A common task in NLP is when we tag a document (could also be a collection of sentences) into a specific category. An example is a news aggregator classifying articles into political, sports, and business. Such classification is useful when there is an enormous amount of unstructured textual data, and no manual labor is available for tagging them. The automatic document classifier could speed-track the process of tagging. Another domain where it’s useful is in classifying movie and product reviews into positive and negative sentiments.
Loading Libraries
Reading the Dataset into the Categorized Corpus
Computing Word Frequency
Checking the Presence of Frequent Words
Training the Model
Most Informative Features
In this corpus, a review that mentions “not” is almost five times more likely to be negative than positive, while a review that mentions “good” is only about three times more likely to be negative than positive. Perhaps the negative-ness of the word “good” might stem from the customers with reviews of the nature, “the product is good but ...” where they may have one or two complain.
If we add more reviews to this corpus of positive and negative, the accuracy will start to improve.
Intent Classification and Question Answering
The two most important NLU tasks a chatbot should perform well are to classify the intent of a given user query and answer questions by understanding the context. While there are many propriety frameworks around these two tasks, they don’t provide the visibility of what happens behind the scene. In this section, we will use a Python library called deeppavlov. It’s an open-source deep learning library for end-to-end dialog systems and chatbots. The library provides many pretrained deep learning models as part of its offering.
Intent Classification
We need to classify a given query (input from the user) into an intent class. Once an intent class is identified, a chatbot can trigger the respective logic as a response to a user query. For example, if the query is “how is the weather today,” the intent classification should trigger the weather services API from within the chatbot and fetch the result.
GetWeather
BookRestaurant
PlayMusic
AddToPlaylist
RateBook
SearchScreeningEvent
SearchCreativeWork
Setting tensorflow as the Back End
Building the Model
Classifying the Intent
You can train a custom model to classify the intent for a specific use case. More details on training a custom model can be found at http://docs.deeppavlov.ai/en/latest/components/classifiers.html#how-to-train-on-other-datasets . Training a custom model is a resource-intensive process. So, if you are trying to build a generic chatbot, we suggest you first explore all the pretrained models shown here before deciding to build your own model: http://docs.deeppavlov.ai/en/latest/components/classifiers.html#pre-trained-models .
Question Answering
Chatbots often need to understand the context of the conversation to answer a particular query from a user. The deeppavlov library provides a pretrained model trained on Stanford Question Answering Dataset (SQuAD) dataset, a reading comprehension dataset consisting of crowdsourced questions on a set of Wikipedia articles. More details on the dataset can be found at https://rajpurkar.github.io/SQuAD-explorer/ .
The main task the model trained on SQuAD dataset performs is to identify a given context and answer a question within the given context.
Building the Model
Context and Question
Serving the DeepPavlov Model
by the end of the above command, we should see the following output, where a Flask app is created and the API is running on the local host. You can specify your own port and URL for hosting the API. More on this can be found at http://docs.deeppavlov.ai/en/latest/devguides/rest_api.html .
In the next chapter, we will introduce our enterprise chatbot named IRIS, where we can directly call the above REST API for intent classification. Note that you still have to train your own model on the private enterprise data in order to integrate it with the chatbot. Even though we will build IRIS using a Java framework, the REST API we have created above is easily called from within the Java application. We can create many applications using the powerful libraries of Python for NLP, NLU, and NLG tasks and simply host all of it as a REST API, which is language and platform agnostic.
Summary
We started by identifying the differences between natural language processing, understanding, and generation, and then discussed various open source tools available to process and understand natural languages.
Then we delved into NLP, where we showed how to use tools like NLTK, spaCy, CoreNLP, genism, and TextBlob for various task such as processing textual data, normalizing text, part-of-speech tagging, dependency parsing, spelling correction, machine translation, and named entity recognition.
In the NLU section, we showed language models like Word2Vec and GloVe for performing out-of-the-box tasks such as word and sentence similarity, finding linear substructures between words, and performing arithmetic operations on word embedding vectors to find meaningful semantic relationships between words. As an important part of NLG, we explored the relationship extraction from a given sentence using the OpenIE tool and built a topic modelling tool using latent Dirichlet allocation (LDA).
We then moved into NLG, where we explored use cases like a random headline generator using the markovify library in Python. And then we explored SimpleNLG, an English grammar-based natural language generation utility. It offers grammatical structures such as generating the past tense, negation, complements, and prepositional phrases. In the NLG section, we built a deep learning-based model for predicting the next word in a given phrase or a sentence. The model used a popular deep learning architecture called long short-term memory.
In the final part, we covered applications of NLP and NLU: topic modelling, gender, document classification, intent classification, and question answering. In the topic modeling, we utilized all of the available open source tools from the previous sections of the chapter.
Overall, in this chapter we explored extensively the P-U-G of natural languages. The availability of many open source tools from Python and Java facilitated a great number of demonstrations to understand and model natural languages. We covered a varied level of topics starting from parsing text data to building generative models using a deep learning model. Our aim with this chapter was to provide an exhaustive collection of methods and tools to empower you to build chatbots with basic and advanced levels of natural language processing, understanding, and generation capabilities.
Next, we will build and deploy a fully functional in-house enterprise chatbot on private datasets. Since there are many chatbot frameworks with support for NLP and NLU, the methods discussed in this chapter at first might seem not so readily usable; however, under the hood, many frameworks like RASA and LUIS internally uses the techniques discussed in this chapter. Also, many ideas from NLG are still not available in any standard chatbot framework, so they are often built from scratch. We believe the ideas taught in this chapter will come handy when you build an enterprise chatbot.