Topic modeling using Bayesian inference

We have seen the supervised learning (classification) of text documents in Chapter 6, Bayesian Classification Models, using the Naïve Bayes model. Often, a large text document, such as a news article or a short story, can contain different topics as subsections. It is useful to model such intra-document statistical correlations for the purpose of classification, summarization, compression, and so on. The Gaussian mixture model learned in the previous section is more applicable for numerical data, such as images, and not for documents. This is because words in documents seldom follow normal distribution. A more appropriate choice would be multinomial distribution.

A powerful extension of mixture models to documents is the work of T. Hofmann on Probabilistic Semantic Indexing (reference 6 in the References section of this chapter) and that of David Blei, et. al. on Latent Dirichlet allocation (reference 7 in the References section of this chapter). In these works, a document is described as a mixture of topics and each topic is described by a distribution of words. LDA is a generative unsupervised model for text documents. The task of LDA is to learn the parameters of the topic distribution, word distributions, and mixture coefficients from data. A brief overview of LDA is presented in the next section. Readers are strongly advised to read the paper by David Blei, et al. to comprehend their approach.

Latent Dirichlet allocation

In LDA, it is assumed that words are the basic units of documents. A word is one element of a set known as vocabulary, indexed by Latent Dirichlet allocation. Here, V denotes the size of the vocabulary. A word can be represented by a unit-basis vector, whose all components are zero except the one corresponding to the word that has a value 1. For example, the nth word in a vocabulary is described by a vector of size V, whose nth component Latent Dirichlet allocation and all other components Latent Dirichlet allocation for Latent Dirichlet allocation. Similarly, a document is a collection of N words denoted by Latent Dirichlet allocation and a corpus is a collection of M documents denoted by Latent Dirichlet allocation (note that documents are represented here by a bold face w, whereas words are without bold face w).

As mentioned earlier, LDA is a generative probabilistic model of a corpus where documents are represented as random mixtures over latent topics and each topic is characterized by a distribution over words. To generate each document w in a corpus in an LDA model, the following steps are performed:

  1. Choose the value of N corresponding to the size of the document, according to a Poisson distribution characterized by parameter Latent Dirichlet allocation:
    Latent Dirichlet allocation
  2. Choose the value of parameter Latent Dirichlet allocation that characterizes the topic distribution from a Dirichlet distribution characterized by parameter Latent Dirichlet allocation:
    Latent Dirichlet allocation
  3. For each of the N words Latent Dirichlet allocation
    1. Choose a topic Latent Dirichlet allocation according to the multinomial distribution characterized by the parameter Latent Dirichlet allocation drawn in step 2:
      Latent Dirichlet allocation
    2. Choose a word Latent Dirichlet allocation from the multinomial probability distribution characterized by Latent Dirichlet allocation and conditioned on Latent Dirichlet allocation:
    Latent Dirichlet allocation

Given values of N, Latent Dirichlet allocation, and Latent Dirichlet allocation, the joint distribution of a topic mixture Latent Dirichlet allocation, set of topics z, and set of words w, is given by:

Latent Dirichlet allocation

Note that, in this case, only w is observed (the documents) and both Latent Dirichlet allocation and z are treated as latent (hidden) variables.

The Bayesian inference problem in LDA is the estimation of the posterior density of latent variables Latent Dirichlet allocation and z, given a document given by:

Latent Dirichlet allocation

As usual, with many Bayesian models, this is intractable analytically and one has to use approximate techniques, such as MCMC or variational Bayes, to estimate the posterior.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset