We can use Latent Dirichlet Allocation to cluster a given set of words into topics and a set of documents to combinations of topics. LDA is useful when identifying the meaning of a document or a word based on the context, not solely depending on the number of words or the exact words. LDA can be used to identify the intent and to resolve ambiguous words in systems such as a search engine. Some other example use-cases of LDA are identifying influential Twitter users for particular topics and Twahpic (http://twahpic.cloudapp.net) application uses LDA to identify topics used on Twitter.
LDA uses the TF vector space model instead of the TF-IDFmodel, as it needs to consider the co-occurrence and correlation of words.
Install and deploy Hadoop MapReduce and HDFS. Export the HADOOP_HOME
environment variable to point to your Hadoop installation root folder.
Download and install Apache Mahout. Export the MAHOUT_HOME
environment variable to point to your Mahout installation root folder. Refer to the Installing Mahout recipe of Chapter 5, Hadoop Ecosystem, for more information on installing Mahout.
The following steps show you how to run Mahout LDA algorithm on a subset of the 20news
dataset:
20news
dataset from http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz.>bin/hadoop fs -mkdir 20news-all >bin/hadoop fs –put <extracted_folder> 20news-all
MAHOUT_HOME
. Generate sequence files from the uploaded text data:>bin/mahout seqdirectory -i 20news-all -o 20news-seq
> bin/mahout seq2sparse –i 20news-seq -o 20news-tf –wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer
SequenceFile<Text, VectorWritable>
to SequenceFile<IntWritable,Text>
:>bin/mahout rowid -i 20news-tf/tf-vectors -o 20news-tf-int
> bin/mahout cvb -i 20news-tf-int/matrix -o lda-out -k 10 -x 20 -dict 20news-tf/dictionary.file-0 –dt lda-topics –mt lda-topic-model
>bin/mahout seqdumper -i lda-topics/part-m-00000 Input Path: lda-topics5/part-m-00000 Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable Key: 0: Value: {0:0.12492744375758073,1:0.03875953927132082,2:0.1228639250669511,3:0.15074522974495433,4:0.10512715697420276,5:0.10130565323653766,6:0.061169131590630275,7:0.14501579630233746,8:0.07872957132697946,9:0.07135655272850545} .....
>bin/mahoutvectordump -i lda-topics/part-m-00000 --dictionary 20news-tf/dictionary.file-0 --vectorSize 10 -dt sequencefile ...... {"Fluxgate:0.12492744375758073,&:0.03875953927132082,(140.220.1.1):0.1228639250669511,(Babak:0.15074522974495433,(Bill:0.10512715697420276,(Gerrit:0.10130565323653766,(Michael:0.061169131590630275,(Scott:0.14501579630233746,(Usenet:0.07872957132697946,(continued):0.07135655272850545} {"Fluxgate:0.13130952097888746,&:0.05207587369196414,(140.220.1.1):0.12533225607394424,(Babak:0.08607740024552457,(Bill:0.20218284543514245,(Gerrit:0.07318295757631627,(Michael:0.08766888242201039,(Scott:0.08858421220476514,(Usenet:0.09201906604666685,(continued):0.06156698532477829} .......
Mahout CVB version of LDA implements the Collapse Variable Bayesian inference algorithm using an iterative MapReduce approach:
>bin/mahout cvb -i 20news-tf-int/matrix -o lda-out -k 10 -x 20 -dict 20news-tf/dictionary.file-0 -dt lda-topics -mt lda-topic-model
The -i
parameter provides the input path, while the -o
parameter provides the path to store the output. -k
specifies the number of topics to learn and -x
specifies the maximum number of iterations for the computation. -dict
points to the dictionary containing the mapping of terms to term-indexes. Path given in the -dt
parameter stores the training topic distribution. Path given in -mt
is used as a temporary location to store the intermediate models.
All the command-line options of the cvb
command can be queried by invoking the help
option as follows:
> bin/mahout cvb --help
Setting the number of topics to a very small value brings out very high-level topics. Large number of topics gives more descriptive topics, but takes longer to process. maxDFPercentoption
can be used to remove common words, thereby speeding up the processing.