Probabilistic latent semantic analysis

Probabilistic Latent Semantic Analysis (pLSA) takes a statistical perspective on LSA and creates a generative model to address the lack of theoretical underpinnings of LSA.

pLSA explicitly models the probability each co-occurrence of documents d and words w described by the DTM as a mixture of conditionally independent multinomial distributions that involve topics t.

The symmetric formulation of this generative process of word-document co-occurrences assumes both words and documents are generated by the latent topic class, whereas the asymmetric model assumes the topics are selected given the document, and words result from a second step given the topic:

The number of topics is a hyperparameter chosen before training and is not learned from the data.

Probabilistic models often use the following plate notation to express dependencies. The following figure encodes the relationships just describe for the asymmetric model. Each rectangle represents multiple items, such as M Documents for the outer and N Words for each document for the inner block. We only observe the documents and their content, and the model infers the hidden or latent topic distribution:

The benefit of using a probability model is that we can now compare models by evaluating the probability they assign to new documents given the parameters learned during training.

Table of Contents for Probabilistic latent semantic analysis

Create new playlist

Sign In

Sign Up

Table of Contents for
Probabilistic latent semantic analysis