How does LDA algorithm work?

LDA is a topic model that infers topics from a collection of text documents. LDA can be thought of as a clustering algorithm where topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset. Topics and documents both exist in a feature space, where feature vectors are vectors of word counts (bags of words). Instead of estimating a clustering using a traditional distance, LDA uses a function based on a statistical model of how text documents are generated (see in Figure 3):

Figure 3: Working principle of LDA algorithms on a collection of documents

Particularly, we would like to discuss which topics people talk about most from the large collection of text. Since the release of Spark 1.3, MLlib supports the LDA, which is one of the most successfully used TM techniques in the area of text mining and NLP.

Moreover, LDA is also the first MLlib algorithm to adopt Spark GraphX. The following terminologies are worth knowing before we formally start our TM application:

  • "word" = "term": an element of the vocabulary
  • "token": instance of a term appearing in a document
  • "topic": multinomial distribution over words representing some concept

The RDD-based LDA algorithm developed in Spark is a topic model designed for text documents. It is based on the original LDA paper (journal version): Blei, Ng, and Jordan, Latent Dirichlet Allocation, JMLR, 2003.

This implementation supports different inference algorithms via the setOptimizer function. The EMLDAOptimizer learns clustering using expectation-maximization (EM) on the likelihood function and yields comprehensive results, while OnlineLDAOptimizer uses iterative mini-batch sampling for online variational inference and is generally memory-friendly.

EM is an iterative way to approximate the maximum likelihood function. In practice, when the input data is incomplete, has missing data points, or has hidden latent variables, ML estimation can find the best fit model.

LDA takes in a collection of documents as vectors of word counts and the following parameters (set using the builder pattern):

  • K: Number of topics (that is, cluster centers) (default is 10).
  • ldaOptimizer: Optimizer to use for learning the LDA model, either EMLDAOptimizer or OnlineLDAOptimizer (default is EMLDAOptimizer).
  • Seed: Random seed for the reproducibility (optional though).
  • docConcentration: Drichilet parameter for prior over documents distributions over topics. Larger values encourage smoother inferred distributions (default is - Vectors.dense(-1)).
  • topicConcentration: Drichilet parameter for prior over topics' distributions over terms (words). Larger values ensure smoother inferred distributions (default is -1).
  • maxIterations: Limit on the number of iterations (default is 20).
  • checkpointInterval: If using checkpointing (set in the Spark configuration), this parameter specifies the frequency with which checkpoints will be created. If maxIterations is large, using check pointing can help reduce shuffle file sizes on disk and help with failure recovery (default is 10).
Figure 4: The topic distribution and how it looks 

Let's see an example. Assume there are n balls in a basket having w different colors. Now also assume each term in a vocabulary has one of w colors. Now also assume that the vocabulary terms are distributed in m topics. Now the frequency of occurrence of each color in the basket is proportional to the corresponding term's weight in topic, φ.

Then the LDA algorithm incorporates a term weighting scheme by making the size of each ball proportional to the weight of its corresponding term. In Figure 4, n terms have the total weights in a topic, for example, topic 0 to 3. Figure 4 shows topic distribution from randomly generated Tweet text.

Now that we have seen that by using TM, we find the structure within an unstructured collection of documents. Once the structure is discovered, as shown in Figure 4, we can answer several questions as follows:

  • What is document X about?
  • How similar are documents X and Y?
  • If I am interested in topic Z, which documents should I read first?

In the next section, we will see an example of TM using a Spark MLlib-based LDA algorithm to answer the preceding questions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset