Other topic models versus the scalability of LDA

Throughout this end-to-end project, we have used LDA, which is one of the most popular TM algorithms used for text mining. We could use more robust TM algorithms, such as Probabilistic Latent Sentiment Analysis (pLSA), Pachinko Allocation Model (PAM), and Hierarchical Drichilet Process (HDP) algorithms.

However, pLSA has the overfitting problem. On the other hand, both HDP and PAM are more complex TM algorithms used for complex text mining, such as mining topics from high-dimensional text data or documents of unstructured text. Finally, non-negative matrix factorization is another way to find topics in a collection of documents. Irrespective of the approach, the output of all the TM algorithms is a list of topics with associated clusters of words.

The previous example shows how to perform TM using the LDA algorithm as a standalone application. The parallelization of LDA is not straightforward, and there have been many research papers proposing different strategies. The key obstacle in this regard is that all methods involve a large amount of communication.

According to the blog on the Databricks website (https://databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html), here are the statistics of the dataset and related training and test sets that were used during the experimentation:

  • Training set size: 4.6 million documents
  • Vocabulary size: 1.1 million terms
  • Training set size: 1.1 billion tokens (~239 words/document)
  • 100 topics
  • 16-worker EC2 cluster, for example, M4.large or M3.medium depending upon budget and requirements

For the preceding setting, the timing result was 176 seconds/iteration on average over 10 iterations. From these statistics, it is clear that LDA is quite scalable for a very large number of the corpus as well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset