Chapter 11. Topic Modeling

Topic modeling is a relatively recent and exciting area that originated from the fields of natural language processing and information retrieval, but has seen applications in a number of other domains as well. Many problems in classification, such as sentiment analysis, involve assigning a single class to a particular observation. In topic modeling, the key idea is that we can assign a mixture of different classes to an observation. As this field takes its inspiration from information retrieval, we often think of our observations as documents and our output classes as topics. In many applications, this is actually the case and so we will focus on the domain of text documents and their topics, this being a very natural way to learn about this important model. In particular, we'll focus on a technique known as Latent Dirichlet Allocation (LDA), which is the most prominently used method for topic modeling.

An overview of topic modeling

In Chapter 10, Probabilistic Graphical Models, we saw how we can use a bag of words as a feature of a Naïve Bayes model in order to perform sentiment analysis. There, the specific predictive task involved determining whether a particular movie review was expressing a positive sentiment or a negative sentiment. We explicitly assumed that the movie review was exclusively expressing only one possible sentiment. Each of the words used as features (such as bad, good, fun, and so on) had a different likelihood of appearing in a review under each sentiment.

To compute the model's decision, we basically computed the likelihood of all the words in a particular review under one class, and compared this to the likelihood of all the words having been generated by the other class. We adjusted these likelihoods using the prior probability of each class, so that, when we know that one class is more popular in the training data, we expect to find it more frequently represented on unseen data in the future. There was no opportunity for a movie review to be partially positive, so that some of the words came from the positive class, and partially negative, so that the rest of the words occurred in the negative class.

The core premise behind topic models is that in our problem we have a set of features and a set of hidden or latent variables that generate these features. Crucially, each observation in our data contains features that have been generated from a mixture or? a subset of these hidden variables. For example, an essay, website, or news article might have a central topic or theme such as politics, but might also include one or more elements from other themes as well, such as human rights, history, or economics.

In the image domain, we might be interested in identifying a particular object in a scene from a set of visual features such as shadows and surfaces. These, in turn, might be the product of a mixture of different objects. Our task in topic modeling is to observe the words inside a document, or the pixels and visual features of an image, and from these determine the underlying mix of topics and objects respectively.

Topic modeling on text data can be used in a number of different ways. One possible application is to group together similar documents, either based on their most predominant topic or based on their topical mix. Thus, it can be viewed as a form of clustering. By studying the topic composition, the most frequent words, as well as the relative sizes of the clusters we obtain, we are able to summarize information about a particular collection of documents.

We can use the most frequent words and topics of a cluster to describe a cluster directly, and in turn this might be useful for automatically generating tags, for example to improve the search capabilities of an information retrieval service for our documents. Yet another example might be to automatically recommending Twitter hashtags once we have built a topic model for a database of tweets.

When we describe documents such as websites using a bag of words approach, each document is essentially a vector indexed by the words in our dictionary. The elements of the vector are either counts of the various words or binary variables capturing whether a word was present in the document. Either way, this representation is a good method of encoding text into a numerical format, but the result is a sparse vector in a high-dimensional space as the word dictionary is typically large. Under a topic model, each document is represented by a mixture of topics. As this number tends to be much smaller than the dictionary size, topic modeling can also function as a form of dimensionality reduction.

Finally, topic modeling can also be viewed as a predictive task for classification. If we have a collection of documents labeled with a predominant theme label, we can perform topic modeling on this collection. If the predominant topic clustering we obtain from this method coincides with our labeled categories, we can use the model to predict a topical mixture for an unknown document and classify it according to the most dominant topic. We'll see an example of this later on in this chapter. We will now introduce the most well-known technique for performing topic modeling, Latent Dirichlet Allocation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset