Topic modeling 

When we have a collection of documents for which we do not clearly know the categories, topic models help us to roughly find the categorization. The model treats each document as a mixture of topics, probably with one dominating topic.

For example, let's suppose we have the following sentences:

  • Eating fruits as snacks is a healthy habit
  • Exercising regularly is an important part of a healthy lifestyle
  • Grapefruit and oranges are citrus fruits

A topic model of these sentences may output the following:

  • Topic A: 40% healthy, 20% fruits, 10% snacks
  • Topic B: 20% Grapefruit, 20% oranges, 10% citrus
  • Sentence 1 and 2: 80% Topic A, 20% Topic B
  • Sentence 3: 100% Topic B

From the output of the model, we can guess that Topic A is about health and Topic B is about fruits. Though these topics are not known apriori, the model outputs corresponding probabilities for words associated with health, exercising, and fruits in the documents.

It is clear from these examples that topic modeling is an unsupervised learning method. It helps in discovering structures or patterns in documents when we have little or no labels for doing text classification. The most popular algorithm for topic modeling is Latent Dirichlet Allocation (LDA) for short. The original paper on LDA uses a variational Bayesian method for estimating the probabilities of words belonging to different topics. The details of the algorithm can be found in the paper Latent Dirichlet Allocation as this is out of the scope of this book (https://dl.acm.org/citation.cfm?id=944937). We will now look at an example of topic modeling using LDA. We will be using the gensim library for the LDA model to find topics in sample texts from the NLTK webtext corpus, which we introduced in Chapter 2, Text Classification and POS Tagging Using NLTK. The complete Jupyter Notebook for this example is available under the Chapter05/01_example.ipynb directory in this book's code repository. First, we will import the necessary Python modules for our example:

from nltk.corpus import webtext, stopwords
import gensim
import random
from pprint import pprint
import numpy as np
import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Here, we import the nltk, webtext, and stopwords corpus. For using the gensim module, we have it using the pip installer:

pip install gensim

This will install gensim on the system. The code that follows reads the sentences from the corresponding text corpus:

firefox = webtext.sents('firefox.txt')
wine = webtext.sents('wine.txt')
pirates = webtext.sents('pirates.txt')

The sents function reads the sentence from the respective text corpus. The sentences from the Firefox discussion forum, the script of the movie Pirates of the Caribbean, and wine reviews are collected in a list, as shown in the following code:

all_docs = []
all_docs.extend(firefox)
all_docs.extend(pirates)
all_docs.extend(wine)
random.shuffle(all_docs)

This will collate all the text documents in a Python list, and we will shuffle the list to transform it into a random list collection of text. We will later verify if the topic model can distinguish between the three different topic categories we have chosen from the NLTK webtext corpus by using the following code:

docs = [[word for word in doc if word not in stopwords.words('english')] for doc in all_docs]
docs = [doc for doc in docs if len(doc)>1]

In the following code, we remove the stop words from the text using the NLTK stopwords corpus. We also discard all documents with length or number of words less than 1:

chunksize=len(docs)
dictionary = gensim.corpora.Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]
model = gensim.models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=3,passes=20)

We use the gensim dictionary class to transform our text collection docs to a bag-of-words representation which is stored in the corpus variable. The dictionary transforms the documents into sparse bag-of-words vectors with corresponding IDs for each word or token. This is passed to the LdaModel along with the dictionary for the ID to word mapping training. We have also set the number of topics equal to 3, which is the same as the number of topics we obtained from the NLTK webtext corpus. Finally, the number of passes has been set to 20 for the model. The code that follows extracts the top topics from the corpus:

top_topics = model.top_topics(corpus)
print("Topic 1: ")
print(top_topics[0][0])
print("Topic 2: ")
print(top_topics[1][0])
print("Topic 3: ")
print(top_topics[2][0])

The following output shows the topics and the probability of words occurring in each topic:


Topic 1:
[(0.029538315, '.'), (0.025298702, '"'), (0.018974159, "'"), (0.017001661, '-'), (0.0097839413, '('), (0.0089947991, 'page'), (0.0080595175, ')'), (0.0076006982, 'window'), (0.0075753955, 'Firefox'), (0.0061700493, 'open'), (0.0058493023, 'menu'), (0.0057583884, 'bar'), (0.005752211, ':'), (0.0057242708, 'tab'), (0.0054682544, 'new'), (0.0053855875, 'Firebird'), (0.0052021407, 'work'), (0.0050605903, 'browser'), (0.00455163, '0'), (0.0045419205, 'button')]

Topic 2:
[(0.10882618, '.'), (0.048713163, ','), (0.033278842, '-'), (0.019521466, 'I'), (0.018609792, '***'), (0.011298033, 'fruit'), (0.010273052, 'good'), (0.0097078849, 'A'), (0.0089780623, 'wine'), (0.0089215562, "'"), (0.0087491088, 'bit'), (0.0080983331, 'quite'), (0.0072782212, 'Top'), (0.0061755609, '****'), (0.0060614017, '**'), (0.005842932, 'nose'), (0.0057750815, 'touch'), (0.0049686432, 'Bare'), (0.0048470194, 'Very'), (0.0047901836, 'palate')]

Topic 3:
[(0.051035155, ','), (0.043318823, ':'), (0.037644491, '.'), (0.029482145, '['), (0.029230012, ']'), (0.023068342, "'"), (0.019555457, '!'), (0.012494524, 'Jack'), (0.011483309, '?'), (0.010315109, '*'), (0.008776715, 'JACK'), (0.008776715, 'SPARROW'), (0.0074223313, '-'), (0.0061529884, 'WILL'), (0.0061529884, 'TURNER'), (0.0060977913, 'Will'), (0.0055771996, 'I'), (0.0054870662, '...'), (0.0041205585, 'ELIZABETH'), (0.0041205585, 'SWANN')]

We can see some patterns in the output of the model based on the word occurrences for each topic, though the output appears to include even punctuation characters. Topic 1 appears to be about the Firefox discussion forum, Topic 2 follows the theme of wine reviews, and Topic 3 is from the Pirates of the Caribbean movie script.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset