Latent Dirichlet Allocation

The last model we want to present in this book is called the Latent Dirichlet Allocation. It is a generative model that can be represented as a graphical model. It's based on the same idea as the mixture model, with one notable exception. In this model, we assume that the data points might be generated by a combination of clusters and not just one cluster at a time, as was the case before.

The LDA model is primarily used in text analysis and classification. Let's consider that a text document is composed of words making sentences and paragraphs. To simplify the problem, we can say that each sentence or paragraph is about one specific topic, such as science, animals, sports, and so on. Topics can also be more specific, such as cats or European soccer. Therefore, there are words that are more likely to come from specific topics. For example, the word cat is likely to come from the topic cats. The word stadium is likely to come from the topic European soccer. However, the word ball should come with a higher probability from the topic European soccer, but it is not unlikely to come from the topic cats, because cats like to play with balls too. So it seems the word ball might belong to two topics at the same time with a different degree of certainty.

Other words such as table will certainly belong equally to both topics and presumably to others. They are very generic. Unless, of course, we introduce another topic such as furniture.

A document is a collection of words, so a document can have complex relationships to a set of topics. But, in the end, it is more likely we will see words coming from the same topic or the same topics within a paragraph, and to some extent in the document.

In general, we model a document with a bag-of-words model that is, we consider a document to be a randomly generated set of words, using a specific distribution over the words. If this distribution is uniform over all the words, then the document will be purely random without a specific meaning. However, if this distribution has a specific form, with more probability mass to related words, then the collection of words generated by this model will have a meaning. Of course, generating documents is not really the application we have in mind for such a model. What we are interested in is the analysis of documents, their classification, and automatic understanding.

The LDA model

Let's say zi is a categorical variable (in other words, a histogram) representing the probability of appearance of all words from a dictionary.

Usually, in this kind of model, we restrict ourselves to long words and remove the small words such as and, to, but, the, a, and so on. These words are usually called stop words.

Let wj be the j-th word in a document. A document generating model could be represented with the following graphical model:

The LDA model

Let θ be a distribution over topics, then we can extend this model by choosing which kind of topic will be selected at any time and then generate a word out of it.

Therefore, the variable zi now becomes the variable zij, which is the topic i selected for the word j. The graphical model is extended as follows:

The LDA model

We can go even further and decide we want to model a collection of documents, which seems natural if we consider we have a big dataset.

Assuming that documents are i.i.d, we can draw the following graphical model again, in which we capture M documents (on the right in the earlier figure).

And because the distribution on θ is categorical, we want to be Bayesian about it, mainly because it will help to model (not to over-fit) and because we consider the selection of topics for a document to be a random process in itself.

Moreover, we want to apply the same treatment to the word variable by having a Dirichlet prior. This prior is used to avoid non-observed words having a zero probability. It smooths the distribution of words per topic. A uniform Dirichlet prior will induce a uniform prior distribution on all the words.

The final graphical model is given by the following figure:

The LDA model

This is quite a complex graphical model but techniques have been developed to fit the parameters and use this model.

So, if we follow this graphical model carefully, we have a process that generates documents based on a certain set of topics:

  • α chooses the set of topics for a document
  • From θ we generate a topic zij
  • From this topic, we generate a word wj

In this model, only the words are observable. All the other variables will have to be determined without observation, exactly like in the other mixture models. So documents are represented as random mixtures over latent topics, in which each topic is represented as a distribution over words.

The distribution of a topic mixture based on this graphical model can be written as:

The LDA model

You see in this formula that for each word we select a topic, hence the product from 1 to N.

Integrating over θ, and summing over z, the marginal distribution of a document is as follows:

The LDA model

The final distribution can be obtained by taking the product of marginal distributions of single documents, so as to get the distribution over a collection of documents (assuming documents are independently and identically distributed). Here, D is the collection of documents:

The LDA model

The main problem to solve now is how to compute the posterior distribution over θ and z given a document. By applying the Bayes formula we know that:

The LDA model

Unfortunately, this is intractable because of the normalization factor at the denominator. The original paper on LDA therefore refers to a technique called variational inference, which aims at transforming a complex Bayesian inference problem into a simpler approximation which can be solved as a (convex) optimization problem. This technique is the third approach to Bayesian inference and has been used on many other problems. In the next section, we briefly review the principles of variational inference and, finally, we will show an example in R to conclude this section.

Variational inference

The main idea in variational inference is to consider a family of lower bounds indexed by variational parameters and optimize on those parameters to find the tightest lower bound. Practically speaking, the idea is to approximate a complicated distribution we wish to evaluate by a simpler distribution such that the distance (or any suitable metric between the distributions) can be minimized with a convex optimization procedure. The reason we want things to be convex is essentially because convex problems have a global minimum.

In general, a good approximation for a graphical model consists in simplifying the graph of the model by decoupling variables. In practice, we remove edges.

In the LDA model, the proposed variational problem is done by decoupling the variables θ and β.

The resulting graphical model after decoupling no longer shows the connection between θ and zi but includes new free variational parameters. The final distribution is given by:

Variational inference

Here, γ is a Dirichlet variable and Φ a multinomial.

The optimization problem requires a way to calculate some distance or discrepancy between the simplified distribution and the real distribution.

This is usually done by using the Kullback-Leibler divergence between the two distributions. The optimization problem is now to find (γ, ϕ) such that:

Variational inference

Many optimization algorithms are able to solve this problem.

The fitting of the parameters of the model can be done again using the EM algorithm. However, as in inference, the E-step is intractable but can be solved with the variational approximation of this problem.

The E-step consists in finding the values for the variational parameters for each document. Then the M-step consists in maximizing the lower bound of the log-likelihood with respect to the parameters α and β. The steps are repeated until convergence of the lower bound on the log-likelihood.

Examples

We will use the RtextTools and topicmodels packages. The second one contains an implementation of the LDA model as described before.

First we load some data:

data(NYTimes)
data ← NYTimes[ samples(1:3100, size=1000,replace=F) ]

The resulting data.frame contains titles and subject and an associated topic.code. This dataset contains headlines from the New York Times.

Then we create a matrix suitable for the LDA() function in the topicmodels package:

matrix ← create_matrix(cbind(as.vector(data$Title),as.vector(data$Subject)), language="english," removeNumbers=TRUE, stemWords=TRUE)

Next, we set up the number of topics. This is computed by looking at the number of unique topic.code in the original data set. This data set has been specially compiled for this task:

k <- length(unique(data$Topic.Code))

Finally, we run the learning algorithm using the variational EM algorithm. This function also provide a Gibbs sampling method to solve the same problem:

lda <- LDA(matrix, k)

The result is a topic model with 27 topics, as expected. Let's see this in detail. The returned object is an S4 object (so you will notice we use the @ notation in R).

Let's take the first document and look at its posterior distribution over the topics:

print(lda@gamma[1,])
 [1] 0.649978052 0.004191364 0.004191364 0.004191364 0.004191364 0.004191364 0.004191364 0.004191364 0.004191364 0.004191364 0.004191364 0.004191364 0.004191364 0.004191364 0.004191364 0.004191364
[17] 0.004191364 0.004191364 0.115045483 0.004191364 0.004191364 0.004191364 0.134383733 0.004191364 0.004191364 0.004191364 0.004191364

We see that the first topic has a higher probability. We can plot this to view it better:

Examples

And we will also look at an average graph over all the documents by doing:

plot(colSums(lda@gamma)/nrow(lda@gamma),t='h')
Examples

From there, we can see the distribution over the topics is clearly not uniform, which would have been really surprising.

So we have a simple way to extract the most probable topic from each document. Note that, in the case of the first document, one topic was highly probable and two others appeared too. The rest were insignificant.

We can, for example, search for the number of documents that have two or more topics with a probability higher than 10%:

sum(sapply( 1:nrow(lda@gamma), function(i) sum(lda@gamma[i,]>0.1) > 1))

We find 649 over 1,000 documents. However, if we look at intervals of 10% from 0% to 100%, we see that this number drops quickly. So it seems that our data set has a lot of documents that identify themselves to only one topic at a time. The following graph shows this progression:

Examples

For example, at 30% only a couple of hundred documents still share their topics among at least two topics. Then the number drops. All sorts of analysis can be done on such a collection, such as finding the words belonging to a topic.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset