We have seen the supervised learning (classification) of text documents in Chapter 6, Bayesian Classification Models, using the Naïve Bayes model. Often, a large text document, such as a news article or a short story, can contain different topics as subsections. It is useful to model such intra-document statistical correlations for the purpose of classification, summarization, compression, and so on. The Gaussian mixture model learned in the previous section is more applicable for numerical data, such as images, and not for documents. This is because words in documents seldom follow normal distribution. A more appropriate choice would be multinomial distribution.
A powerful extension of mixture models to documents is the work of T. Hofmann on Probabilistic Semantic Indexing (reference 6 in the References section of this chapter) and that of David Blei, et. al. on Latent Dirichlet allocation (reference 7 in the References section of this chapter). In these works, a document is described as a mixture of topics and each topic is described by a distribution of words. LDA is a generative unsupervised model for text documents. The task of LDA is to learn the parameters of the topic distribution, word distributions, and mixture coefficients from data. A brief overview of LDA is presented in the next section. Readers are strongly advised to read the paper by David Blei, et al. to comprehend their approach.
In LDA, it is assumed that words are the basic units of documents. A word is one element of a set known as vocabulary, indexed by . Here, V denotes the size of the vocabulary. A word can be represented by a unit-basis vector, whose all components are zero except the one corresponding to the word that has a value 1. For example, the nth word in a vocabulary is described by a vector of size V, whose nth component and all other components for . Similarly, a document is a collection of N words denoted by and a corpus is a collection of M documents denoted by (note that documents are represented here by a bold face w, whereas words are without bold face w).
As mentioned earlier, LDA is a generative probabilistic model of a corpus where documents are represented as random mixtures over latent topics and each topic is characterized by a distribution over words. To generate each document w in a corpus in an LDA model, the following steps are performed:
Given values of N, , and , the joint distribution of a topic mixture , set of topics z, and set of words w, is given by:
Note that, in this case, only w is observed (the documents) and both and z are treated as latent (hidden) variables.
The Bayesian inference problem in LDA is the estimation of the posterior density of latent variables and z, given a document given by:
As usual, with many Bayesian models, this is intractable analytically and one has to use approximate techniques, such as MCMC or variational Bayes, to estimate the posterior.