Topic discovery using Latent Dirichlet Allocation (LDA)

We can use Latent Dirichlet Allocation to cluster a given set of words into topics and a set of documents to combinations of topics. LDA is useful when identifying the meaning of a document or a word based on the context, not solely depending on the number of words or the exact words. LDA can be used to identify the intent and to resolve ambiguous words in systems such as a search engine. Some other example use-cases of LDA are identifying influential Twitter users for particular topics and Twahpic (http://twahpic.cloudapp.net) application uses LDA to identify topics used on Twitter.

LDA uses the TF vector space model instead of the TF-IDFmodel, as it needs to consider the co-occurrence and correlation of words.

Getting ready

Install and deploy Hadoop MapReduce and HDFS. Export the HADOOP_HOME environment variable to point to your Hadoop installation root folder.

Download and install Apache Mahout. Export the MAHOUT_HOME environment variable to point to your Mahout installation root folder. Refer to the Installing Mahout recipe of Chapter 5, Hadoop Ecosystem, for more information on installing Mahout.

How to do it…

The following steps show you how to run Mahout LDA algorithm on a subset of the 20news dataset:

  1. Download and extract the 20news dataset from http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz.
  2. Upload the extracted data to the HDFS:
    >bin/hadoop fs -mkdir 20news-all
    >bin/hadoop fs –put  <extracted_folder> 20news-all
    
  3. Go to the MAHOUT_HOME. Generate sequence files from the uploaded text data:
    >bin/mahout seqdirectory -i 20news-all -o 20news-seq
    
  4. Generate sparse vector from the text data in the sequence files:
    > bin/mahout seq2sparse 
       –i 20news-seq  -o 20news-tf 
       –wt tf 
       -a org.apache.lucene.analysis.WhitespaceAnalyzer
    
  5. Convert the TF vectors from SequenceFile<Text, VectorWritable> to SequenceFile<IntWritable,Text>:
    >bin/mahout rowid -i 20news-tf/tf-vectors -o 20news-tf-int
    
  6. Run the following command to perform the LDA computation:
    > bin/mahout cvb 
       -i 20news-tf-int/matrix -o lda-out 
       -k 10  -x 20 
       -dict 20news-tf/dictionary.file-0 
       –dt lda-topics 
       –mt lda-topic-model
    
  7. Dump and inspect the results of the LDA computation:
    >bin/mahout seqdumper -i lda-topics/part-m-00000
    Input Path: lda-topics5/part-m-00000
    Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable
    Key: 0: Value: {0:0.12492744375758073,1:0.03875953927132082,2:0.1228639250669511,3:0.15074522974495433,4:0.10512715697420276,5:0.10130565323653766,6:0.061169131590630275,7:0.14501579630233746,8:0.07872957132697946,9:0.07135655272850545}
    .....
    
  8. Join the output vectors with the dictionary mapping of term to term indexes:
    >bin/mahoutvectordump -i lda-topics/part-m-00000 --dictionary 20news-tf/dictionary.file-0 --vectorSize 10  -dt sequencefile
    ......
    
    {"Fluxgate:0.12492744375758073,&:0.03875953927132082,(140.220.1.1):0.1228639250669511,(Babak:0.15074522974495433,(Bill:0.10512715697420276,(Gerrit:0.10130565323653766,(Michael:0.061169131590630275,(Scott:0.14501579630233746,(Usenet:0.07872957132697946,(continued):0.07135655272850545}
    {"Fluxgate:0.13130952097888746,&:0.05207587369196414,(140.220.1.1):0.12533225607394424,(Babak:0.08607740024552457,(Bill:0.20218284543514245,(Gerrit:0.07318295757631627,(Michael:0.08766888242201039,(Scott:0.08858421220476514,(Usenet:0.09201906604666685,(continued):0.06156698532477829}
    .......
    

How it works…

Mahout CVB version of LDA implements the Collapse Variable Bayesian inference algorithm using an iterative MapReduce approach:

>bin/mahout cvb -i 20news-tf-int/matrix -o lda-out -k 10  -x 20  -dict 20news-tf/dictionary.file-0 -dt lda-topics -mt lda-topic-model

The -i parameter provides the input path, while the -o parameter provides the path to store the output. -k specifies the number of topics to learn and -x specifies the maximum number of iterations for the computation. -dict points to the dictionary containing the mapping of terms to term-indexes. Path given in the -dt parameter stores the training topic distribution. Path given in -mt is used as a temporary location to store the intermediate models.

All the command-line options of the cvb command can be queried by invoking the help option as follows:

> bin/mahout  cvb  --help

Setting the number of topics to a very small value brings out very high-level topics. Large number of topics gives more descriptive topics, but takes longer to process. maxDFPercentoption can be used to remove common words, thereby speeding up the processing.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset