Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Topic modeling using Bayesian inference

We have seen the supervised learning (classification) of text documents in Chapter 6, Bayesian Classification Models, using the Naïve Bayes model. Often, a large text document, such as a news article or a short story, can contain different topics as subsections. It is useful to model such intra-document statistical correlations for the purpose of classification, summarization, compression, and so on. The Gaussian mixture model learned in the previous section is more applicable for numerical data, such as images, and not for documents. This is because words in documents seldom follow normal distribution. A more appropriate choice would be multinomial distribution.

A powerful extension of mixture models to documents is the work of T. Hofmann on Probabilistic Semantic Indexing (reference 6 in the References section of this chapter) and that of David Blei, et. al. on Latent Dirichlet allocation (reference 7 in the References section of this chapter). In these works, a document is described as a mixture of topics and each topic is described by a distribution of words. LDA is a generative unsupervised model for text documents. The task of LDA is to learn the parameters of the topic distribution, word distributions, and mixture coefficients from data. A brief overview of LDA is presented in the next section. Readers are strongly advised to read the paper by David Blei, et al. to comprehend their approach.

Latent Dirichlet allocation

In LDA, it is assumed that words are the basic units of documents. A word is one element of a set known as vocabulary, indexed by . Here, V denotes the size of the vocabulary. A word can be represented by a unit-basis vector, whose all components are zero except the one corresponding to the word that has a value 1. For example, the n^th word in a vocabulary is described by a vector of size V, whose n^th component and all other components for . Similarly, a document is a collection of N words denoted by and a corpus is a collection of M documents denoted by (note that documents are represented here by a bold face w, whereas words are without bold face w).

As mentioned earlier, LDA is a generative probabilistic model of a corpus where documents are represented as random mixtures over latent topics and each topic is characterized by a distribution over words. To generate each document w in a corpus in an LDA model, the following steps are performed:

Choose the value of N corresponding to the size of the document, according to a Poisson distribution characterized by parameter :
Choose the value of parameter that characterizes the topic distribution from a Dirichlet distribution characterized by parameter :
For each of the N words
1. Choose a topic according to the multinomial distribution characterized by the parameter drawn in step 2:
2. Choose a word from the multinomial probability distribution characterized by and conditioned on :

Given values of N, , and , the joint distribution of a topic mixture , set of topics z, and set of words w, is given by:

Note that, in this case, only w is observed (the documents) and both and z are treated as latent (hidden) variables.

The Bayesian inference problem in LDA is the estimation of the posterior density of latent variables and z, given a document given by:

As usual, with many Bayesian models, this is intractable analytically and one has to use approximate techniques, such as MCMC or variational Bayes, to estimate the posterior.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Topic modeling using Bayesian inference

Create new playlist

Sign In

Sign Up

Topic modeling using Bayesian inference

Latent Dirichlet allocation

Table of Contents for
Topic modeling using Bayesian inference