Building a topic model

Unfortunately, scikit-learn does not implement latent Dirichlet allocation. Therefore, we are going to use the gensim package from Python. Gensim was developed by Radim Řehůřek who is a machine learning researcher and consultant in the United Kingdom.

As input data, we are going to use a collection of news reports from the Associated Press (AP). This is a standard dataset for text modeling research, which was used in some of the initial works on topic models. After downloading the data, we can load it by running the following code:

import gensim 
from gensim import corpora, models 
corpus = corpora.BleiCorpus('./data/ap/ap.dat', './data/ap/vocab.txt') 

The corpus variable holds all of the text documents and has loaded them in a format that makes for easy processing. We can now build a topic model, using this object as input:

model = models.ldamodel.LdaModel( 
              corpus, 
              num_topics=100, 
              id2word=corpus.id2word) 

This single constructor call will statistically infer which topics are present incorpus. We can explore the resulting model in many ways. We can see the list of topics a document refers to using the model[doc] syntax, as shown in the following example:

doc = corpus.docbyoffset(0) 
topics = model[doc] 
print(topics) 
[(3, 0.023607255776894751), 
 (13, 0.11679936618551275), 
 (19, 0.075935855202707139), 
.... 
 (92, 0.10781541687001292)] 

The result will almost surely look different on our computer! The learning algorithm uses some random numbers, and every time you learn a new topic model on the same input data, the result is different. What is important is that some of the qualitative properties of the model will be stable across different runs if your data is well behaved. For example, if you are using the topics to compare documents, as we do here, then the similarities should be robust and change only slightly. On the other hand, the order of the different topics will be completely different.

The format of the result is a list of pairs: (topic_index, topic_weight). We can see that only a few topics are used for each document (in the preceding example, there is no weight for topics 0, 1, and 2; the weight for those topics is 0). The topic model is a sparse model, as although there are many possible topics, for each document, only a few of them are used. This is not strictly true as all the topics have a non-zero probability in the LDA model, but some of them have such a small probability that we can round it to zero as a good approximation.

We can explore this further by plotting a histogram of the number of topics that each document refers to:

num_topics_used = [len(model[doc]) for doc in corpus] 
fig,ax = plt.subplots() 
ax.hist(num_topics_used) 

You will get the following plot:

Sparsity means that while you may have large matrices and vectors, in principle, most of the values are zero (or so small that we can round them to zero as a good approximation). Therefore, only a few things are relevant at any given time.
Often problems that seem too big to solve are actually feasible because the data is sparse. For example, even though any web page can link to any other web page, the graph of links is actually very sparse as each web page will link to a very tiny fraction of all other web pages.

In the preceding graph, we can see more or less about the majority of documents deals with around 10 topics.

To a large extent, this is due to the value of the parameters that were used, namely, the alpha parameter. The exact meaning of alpha is a bit abstract, but bigger values for alpha will result in more topics per document.

Alpha needs to be a value greater than zero, but is typically set to a lesser value, usually, less than one. The smaller the value of alpha, the fewer topics each document will be expected to discuss. By default, gensim will set alpha to 1/num_topics, but you can set it explicitly by passing it as an argument in the LdaModel constructor as follows:

model = models.ldamodel.LdaModel( 
              corpus, 
              num_topics=100, 
              id2word=corpus.id2word, 
              alpha=1) 

In this case, this is a larger alpha than the default, which should lead to more topics per document. As we can see in the combined histogram given next, gensim behaves as we expected and assigns more topics to each document:

Now, we can see in the preceding histogram that many documents touch upon 20 to 25 different topics. If you set the value lower, you will observe the opposite (downloading the code from the online repository will allow you to play around with these values).

What are these topics? Technically, as we discussed earlier, they are multinomial distributions over words, which means that they assign a probability to each word in the vocabulary. Words with a high probability are more associated with that topic than words with a lower probability:

Our brains are not very good at reasoning with probability distributions, but we can readily make sense of a list of words. Therefore, it is typical to summarize topics with a list of the most highly weighted words.

In the following table, we display the first ten topics:

Topic no.

Topic

1

Dress military soviet president new state capt carlucci states leader stance government

2

Koch Zambia Lusaka oneparty orange Kochs party i government mayor new political

3

Human turkey rights abuses royal Thompson threats new state wrote garden president

4

Bill employees experiments levin taxation federal measure legislation senate president whistle blowers sponsor

5

Ohio July drought Jesus disaster percent Hartford Mississippi crops northern valley Virginia

6

United percent billion year president world years states people i bush news

7

b Hughes affidavit states united ounces squarefoot care delaying charged unrealistic bush

8

yeutter dukakis bush convention farm subsidies Uruguay percent secretary general i told

9

Kashmir government people Srinagar India dumps city two Jammu Kashmir group Mosley Pakistan

10

workers Vietnamese Irish wage immigrants percent bargaining last island police Hutton i

 

Although daunting at first glance, when reading through the list of words, we can clearly see that the topics are not just random words, but instead these are logical groups. We can also see that these topics refer to older news items, from when the Soviet Union still existed and Gorbachev was its general secretary. We can also represent the topics as a word cloud, making more likely words larger. For example, this is the visualization of a topic that deals with the police:

We can also see that some of the words should perhaps be removed as they are not so informative; they are stop words. When building topic modeling, it can be useful to filter out stop words, as otherwise you might end up with a topic consisting entirely of stop words. We may also wish to preprocess the text to stems in order to normalize plurals and verb forms. This process was covered in Chapter 6Clustering - Finding Related Posts, and you can refer to it for details. If you are interested, you can download the code from the companion website of the book and try all these variations to draw different pictures.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset