Comparing documents by topic

Topics can be useful on their own to build the sort of small vignettes with words that are shown in the previous screenshot. These visualizations can be used to navigate a large collection of documents. For example, a website can display the different topics as different word clouds, allowing a user to click through to the documents. In fact, they have been used in just this way to analyze large collections of documents.

However, topics are often just an intermediate tool to another end. Now that we have an estimate for each document of how much of that document comes from each topic, we can compare the documents in topic space. This simply means that instead of comparing word to word, we say that two documents are similar if they talk about the same topics.

This can be very powerful as two text documents that share few words may actually refer to the same topic! They may just refer to it using different constructions (for example, one document may speak of the United Kingdom while the other will use the abbreviation UK).

Topic models are good on their own to build visualizations and explore data. They are also very useful as an intermediate step in many other tasks.

At this point, we can redo the task of finding the most similar post to an input query, by using the topics to define similarity. Whereas earlier we compared two documents by comparing their word vectors directly, we can now compare two documents by comparing their topic vectors.

For this, we are going to project the documents to the topic space. That is, we want to have a vector of topics that summarizes the document. This is another example of dimensionality reduction of the type discussed in Chapter 5, Dimensionality Reduction. Here, we show you how topic models can be used for exactly this purpose; once topics have been computed for each document, we can perform operations on the topic vector and forget about the original words. If the topics are meaningful, they will be potentially more informative than the raw words. Additionally, this may bring computational advantages, as it is much faster to compare vectors of topic weights than vectors that are as large as the input vocabulary (which will contain thousands of terms).

Using gensim, we looked at how to compute the topics corresponding to all the documents in the corpus. We will now compute these for all the documents and store it in a NumPy array and compute all pairwise distances:

from gensim import matutils 
topics = matutils.corpus2dense(model[corpus], num_terms=model.num_topics)

Now, topics is a matrix of topics. We can use the pdist function in SciPy to compute all pairwise distances. There are several distances available. By default, it uses the Euclidean distance. That is, with a single function call, we compute all the values of sum((topics[ti] - topics[tj])**2):

from scipy.spatial import distance 
distances = distance.squareform(distance.pdist(topics))

Now, we will employ one last little trick; we will set the diagonal elements of the distance matrix to infinity to ensure that it will appear as larger than any other:

 for ti in range(len(topics)): 
     distances[ti,ti] = np.inf

And we are done! For each document, we can look up the closest element easily (this is a type of nearest-neighbor classifier):

def closest_to(doc_id): 
    return distances[doc_id].argmin()

This will not work if we have not set the diagonal elements to a large value: the function will always return the same element as it is the one most similar to itself (except in the weird case where two elements have exactly the same topic distribution, which is very rare unless they are exactly the same).

For example, here is one possible query document (it is the second document in our collection):

From: [email protected] (Gordon Banks) 
Subject: Re: request for information on "essential tremor" and 
 Indrol? 
 
In article <[email protected]> [email protected] 
 writes: 
 
Essential tremor is a progressive hereditary tremor that gets 
 worse 
when the patient tries to use the effected member.  All limbs, 
 vocal 
cords, and head can be involved.  Inderal is a beta-blocker and 
is usually effective in diminishing the tremor.  Alcohol and 
 mysoline 
are also effective, but alcohol is too toxic to use as a 
 treatment. 
-- 
------------------------------------------------------------------
 ---------- 
Gordon Banks  N3JXP      | "Skepticism is the chastity of the 
 intellect, and 
[email protected]   |  it is shameful to surrender it too 
 soon." 
  ----------------------------------------------------------------
 ------------

If we ask for the most similar document to closest_to(1), we receive the following document as a result:

From: [email protected] (Gordon Banks) 
Subject: Re: High Prolactin 
 
In article <[email protected]> [email protected] 
 (John E. Rodway) writes: 
>Any comments on the use of the drug Parlodel for high prolactin 
 in the blood? 
> 
 
It can suppress secretion of prolactin. Is useful in cases of galactorrhea. 
Some adenomas of the pituitary secret too much. 
 
-- 
------------------------------------------------------------------
 ---------- 
Gordon Banks  N3JXP      | "Skepticism is the chastity of the 
 intellect, and 
[email protected]   |  it is shameful to surrender it too 
 soon."

The system returns a post by the same author discussing medication.

Table of Contents for Comparing documents by topic

Create new playlist

Sign In

Sign Up

Table of Contents for
Comparing documents by topic