Deploying the trained LDA model

For this mini deployment, let's use a real-life dataset: PubMed. A sample dataset containing PubMed terms can be downloaded from: https://nlp.stanford.edu/software/tmt/tmt-0.4/examples/pubmed-oa-subset.csv. This link actually contains a dataset in CSV format but has a strange name, 4UK1UkTX.csv.

To be more specific, the dataset contains some abstracts of some biological articles, their publication year, and the serial number. A glimpse is given in the following figure:

Figure 6: A snapshot of the sample dataset

In the following  code, we have already saved the trained LDA model for future use as follows:

params.ldaModel.save(spark.sparkContext, "model/LDATrainedModel")

The trained model will be saved to the previously mentioned location. The directory will include data and metadata about the model and the training itself as shown in the following figure:

Figure 7: The directory structure of the trained and saved LDA model

As expected, the data folder has some parquet files containing global topics, their counts, tokens and their counts, and the topics with their respective counts. Now the next task will be restoring the same model as follows:

//Restoring the model for reuse
val savedLDAModel = DistributedLDAModel.load(spark.sparkContext, "model/LDATrainedModel/")

//Then we execute the following workflow:
val lda = new LDAforTM()
// actual computations are done here

// Loading the parameters to train the LDA model
val defaultParams = Params().copy(input = "data/4UK1UkTX.csv", savedLDAModel)
lda.run(defaultParams)
// Training the LDA model with the default parameters.
spark.stop()
>>>
Training corpus summary:
-------------------------------
Training set size: 1 documents
Vocabulary size: 14670 terms
Number of tockens: 14670 tokens
Preprocessing time: 12.921435786 sec
-------------------------------
Finished training LDA model.
Summary:
Training time: 23.243336895 sec
The average log likelihood of the training data: -1008739.37857908
5 topics:
TOPIC 0
------------------------------
rrb 0.015234818404037585
lrb 0.015154125349208018
sequence 0.008924621534990771
gene 0.007391453509409655
cell 0.007020265462594214
protein 0.006479622004524878
study 0.004954523307983932
show 0.0040023453035193685
site 0.0038006126784248945
result 0.0036634344941610534
----------------------------
weight: 0.07662582204885438
TOPIC 1
------------------------------
rrb 1.745030693927338E-4
lrb 1.7450110447001028E-4
sequence 1.7424254444446083E-4
gene 1.7411236867642102E-4
cell 1.7407234230511066E-4
protein 1.7400587965300172E-4
study 1.737407317498879E-4
show 1.7347354627656383E-4
site 1.7339989737227756E-4
result 1.7334522348574853E-4
---------------------------
weight: 0.07836521875668061
TOPIC 2
------------------------------
rrb 1.745030693927338E-4
lrb 1.7450110447001028E-4
sequence 1.7424254444446083E-4
gene 1.7411236867642102E-4
cell 1.7407234230511066E-4
protein 1.7400587965300172E-4
study 1.737407317498879E-4
show 1.7347354627656383E-4
site 1.7339989737227756E-4
result 1.7334522348574853E-4
----------------------------
weight: 0.08010461546450684
TOPIC 3
------------------------------
rrb 1.745030693927338E-4
lrb 1.7450110447001028E-4
sequence 1.7424254444446083E-4
gene 1.7411236867642102E-4
cell 1.7407234230511066E-4
protein 1.7400587965300172E-4
study 1.737407317498879E-4
show 1.7347354627656383E-4
site 1.7339989737227756E-4
result 1.7334522348574853E-4
----------------------------
weight: 0.08184401217233307
TOPIC 4
------------------------------
rrb 1.745030693927338E-4
lrb 1.7450110447001028E-4
sequence 1.7424254444446083E-4
gene 1.7411236867642102E-4
cell 1.7407234230511066E-4
protein 1.7400587965300172E-4
study 1.737407317498879E-4
show 1.7347354627656383E-4
site 1.7339989737227756E-4
result 1.7334522348574853E-4
----------------------------
weight: 0.0835834088801593

Well done! We have managed to reuse the model and do the same prediction. But, probably due to the randomness of data, we observed a slightly different prediction. Let's see the complete code to get a clearer view:

package com.packt.ScalaML.Topicmodeling
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}

object LDAModelReuse {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.master("local[*]")
.config("spark.sql.warehouse.dir", "data/")
.appName(s"LDA_TopicModelling")
.getOrCreate()

//Restoring the model for reuse
val savedLDAModel = DistributedLDAModel.load(spark.sparkContext, "model/LDATrainedModel/")
val lda = new LDAforTM()
// actual computations are done here
val defaultParams = Params().copy(input = "data/4UK1UkTX.csv", savedLDAModel)
//Loading params
lda.run(defaultParams)
// Training the LDA model with the default parameters.
spark.stop()
}
}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset