Now to get some more statistics, such as maximum likelihood or log likelihood on the document, we can use the following code:
if (ldaModel.isInstanceOf[DistributedLDAModel]) {
val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]
val avgLogLikelihood = distLDAModel.logLikelihood / actualCorpusSize.toDouble
println("The average log likelihood of the training data: " +
avgLogLikelihood)
println()
}
The preceding code calculates the average log likelihood of the LDA model as an instance of the distributed version of the LDA model:
The average log likelihood of the training data: -209692.79314860413
For more information on the likelihood measurement, interested readers should refer to https://en.wikipedia.org/wiki/Likelihood_function.
Now imagine that we've computed the preceding metric for document X and Y. Then we can answer the following question:
- How similar are documents X and Y?
The thing is, we should try to get the lowest likelihood from all the training documents and use it as a threshold for the previous comparison. Finally, to answer the third and final question:
- If I am interested in topic Z, which documents should I read first?
A minimal answer: taking a close look at the topic distributions and the relative term weights, we can decide which document we should read first.