Implementation

The following steps show TM from data reading to printing the topics, along with their term weights. Here's the short workflow of the TM pipeline:

object topicmodelingwithLDA {
    def main(args: Array[String]): Unit = {
        val lda = 
        new LDAforTM() 
// actual computations are done here
        val defaultParams = Params().copy(input = "data/docs/") //Loading parameters for training
        lda.run(defaultParams) 
// Training the LDA model with the default parameters.
      }
}

We also need to import some related packages and libraries:

import edu.stanford.nlp.process.Morphology
import edu.stanford.nlp.simple.Document
import org.apache.log4j.{Level, Logger}
import scala.collection.JavaConversions._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg.{Vector => MLVector}
import org.apache.spark.mllib.clustering.{DistributedLDAModel, EMLDAOptimizer, LDA, OnlineLDAOptimizer, LDAModel}
import org.apache.spark.mllib.linalg.{ Vector, Vectors }
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SparkSession}

The actual computation on TM is done in the LDAforTM class. The Params is a case class, which is used for loading the parameters to train the LDA model. Finally, we train the LDA model using the parameters setting via the Params class. Now we will explain each step broadly with step-by-step source code:

Table of Contents for Implementation

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementation