Introduction to topic models

As per Wikipedia, a topic model is defined as follows :

"In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body."

Topic models are essentially iterative algorithms that work with document feature matrices, to use overlapping features to group documents together. Features could simply be all the words in a sentence, or selected features such as nouns or named entities, and so on. To explain in a simplistic manner, we imagine that we have a corpus of documents of mixed subjects and we use words as features to represent a document. If we had to analyse these documents using topic models, and the topic model would group words like "team", "match", "game", and "score" in a single topic (as these word frequently appear together) which we would name as a SPORT topic, while words like "attorney", "case", "law", and "crime" in another topic that we would name as a LEGAL topic. This example shows that the documents that we used for topic models were essentially containing two topics, SPORT and LEGAL, as you can see in the following illustration:

The preceding example is very simplistic one to understand the concept of how to discover the topics in a number of given unknown documents. In practice, the process could be very complex.

Two of the best known topic models are Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). To go deep into the mathematical explanations and comparisons of the two is beyond the scope of this book, however, we will use LDA in this chapter, as it is proven to be better suited for clustering with high dimensionality and more accurate in identifying topics than LSA. The latter, of course, is faster to implement.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset