Previously in this chapter, we looked at a number of ways to programmatically see what's present in documents. We saw how to identify people, places, dates, and other things in documents. We saw how to break things up into sentences.
Another, more sophisticated way to discover what's in a document is to use topic modeling. Topic modeling attempts to identify a set of topics that are contained in the document collection. Each topic is a cluster of words that are used together throughout the corpus. These clusters are found in individual documents to varying degrees, and a document is composed of several topics to varying extents. We'll take a look at this in more detail in the explanation for this recipe.
To perform topic modeling, we'll use MALLET (http://mallet.cs.umass.edu/). This is a library and utility that implements topic modeling in addition to several other document classification algorithms.
For this recipe, we'll need these lines in our project.clj
file:
(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [cc.mallet/mallet "2.0.7"]])
Our imports and requirements for this are pretty extensive too, as shown here:
(require '[clojure.java.io :as io]) (import [cc.mallet.util.*] [cc.mallet.types InstanceList] [cc.mallet.pipe Input2CharSequence TokenSequenceLowercase CharSequence2TokenSequence SerialPipes TokenSequenceRemoveStopwords TokenSequence2FeatureSequence] [cc.mallet.pipe.iterator FileListIterator] [cc.mallet.topics ParallelTopicModel] [java.io FileFilter])
Again, we'll use the State of the Union addresses that we've already seen several timesin this chapter. You can download these from http://www.ericrochester.com/clj-data-analysis/data/sotu.tar.gz. I've unpacked the data from this file into the sotu
directory.
We'll need to work the documents through several phases to perform topic modeling, as follows:
(defn make-pipe-list [] (InstanceList. (SerialPipes. [(Input2CharSequence. "UTF-8") (CharSequence2TokenSequence. #"p{L}[p{L}p{P}]+p{L}") (TokenSequenceLowercase.) (TokenSequenceRemoveStopwords. false false) (TokenSequence2FeatureSequence.)])))
InstanceList
, which is a collection of documents along with their metadata:(defn add-directory-files [instance-list corpus-dir] (.addThruPipe instance-list (FileListIterator. (.listFiles (io/file corpus-dir)) (reify FileFilter (accept [this pathname] true)) #"/([^/]*).txt$" true)))
InstanceList
and some other parameters and trains a topic model, which it returns:(defn train-model ([instances] (train-model 100 4 50 instances)) ([num-topics num-threads num-iterations instances] (doto (ParallelTopicModel. num-topics 1.0 0.01) (.addInstances instances) (.setNumThreads num-threads) (.setNumIterations num-iterations) (.estimate))))
Now, we can take these three functions and use them to train a topic model. While training, it will output some information about the process, and finally, it will list the top terms for each topic:
user=> (def pipe-list (make-pipe-list)) user=> (add-directory-files pipe-list "sotu/") user=> (def tm (train-model 10 4 50 pipe-list)) … INFO: 0 0.1 government federal year national congress war 1 0.1 world nation great power nations people 2 0.1 world security years programs congress program 3 0.1 law business men work people good 4 0.1 america people americans american work year 5 0.1 states government congress public people united 6 0.1 states public made commerce present session 7 0.1 government year department made service legislation 8 0.1 united states congress act government war 9 0.1 war peace nation great men people
It's difficult to succinctly and clearly explain how topic modeling works. Conceptually, it assigns words from the documents to buckets (topics). This is done in such a way that randomly drawing words from the buckets will most probably recreate the documents.
Interpreting the topics is always interesting. Generally, it involves taking a look at the top words for each topic and cross-referencing them with the documents that scored most highly for this topic.
For example, take the fourth topic, with the top words law, business, men, and work. The top-scoring document for this topic was the 1908 SOTU, with a distribution of 0.378. This was given by Theodore Roosevelt, and in his speech, he talked a lot about labor issues and legislation to rein in corrupt corporations. All of the words mentioned were used a lot, but understanding exactly what the topic is about isn't evident without actually taking a look at the document itself.
There are a number of good papers and tutorials on topic modeling. There's a good tutorial written by Shawn Graham, Scott Weingart, and Ian Milligan at http://programminghistorian.org/lessons/topic-modeling-and-mallet
For a more rigorous explanation, check out Mark Steyvers's introduction Probabilistic Topic Models, which you can see at http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf
For some information on how to evaluate the topics that you get, see http://homepages.inf.ed.ac.uk/imurray2/pub/09etm