Implementation

The following steps show TM from data reading to printing the topics, along with their term weights. Here's the short workflow of the TM pipeline:

object topicmodelingwithLDA {
def main(args: Array[String]): Unit = {
val lda =
new LDAforTM()
// actual computations are done here
val defaultParams = Params().copy(input = "data/docs/") //Loading parameters for training
lda.run(defaultParams)
// Training the LDA model with the default parameters.
}
}

We also need to import some related packages and libraries:

import edu.stanford.nlp.process.Morphology
import edu.stanford.nlp.simple.Document
import org.apache.log4j.{Level, Logger}
import scala.collection.JavaConversions._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg.{Vector => MLVector}
import org.apache.spark.mllib.clustering.{DistributedLDAModel, EMLDAOptimizer, LDA, OnlineLDAOptimizer, LDAModel}
import org.apache.spark.mllib.linalg.{ Vector, Vectors }
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SparkSession}

The actual computation on TM is done in the LDAforTM class. The Params is a case class, which is used for loading the parameters to train the LDA model. Finally, we train the LDA model using the parameters setting via the Params class. Now we will explain each step broadly with step-by-step source code:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset