Topic Modeling

In Chapter 6Clustering - Finding Related Posts, we grouped text documents using clustering. This is a very useful tool, but it is not always the best. Clustering results in each text belonging to exactly one cluster. This book is about machine learning and Python. Should it be grouped with other Python-related works or with machine-related works? In a physical bookstore, we need to choose a single place to stock the book. In an internet store, however, the answer is that this book is about both machine learning and Python, and the book should be listed in both sections. This does not mean that the book will be listed in all sections, of course. We will not list this book with other baking books.

In this chapter, we will learn methods that do not cluster documents into completely separate groups but that allow each document to refer to several topics. These topics will be identified automatically from a collection of text documents. These documents may be whole books or shorter pieces of text, such as a blog post, a news story, or an email.

We would also like to be able to infer the fact that these documents may have topics that are central to them, while referring to other topics only in passing. This book mentions plotting every so often, but it is not a central topic like machine learning. This means that documents have topics that are central to them and others that are more peripheral. The subfield of machine learning that deals with these problems is called topic modeling and is the subject of this chapter. In particular, we will learn about the following:

  • What topic models are and, in particular, about latent Dirichlet allocation (LDA)
  • How to use the gensim package to build topic models
  • How topic models can be useful as an intermediate representation for different applications
  • How we can build a topic model of the whole of the English-language Wikipedia
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset