问题
I'm work with Mllib of Spark, and now is doing something with LDA.
But when I use the code provided by Spark(see bellow) to predict a Doc used in training the model, the result(document-topics) of predict is at opposite poles with the result of trained document-topics.
I don't know what caused the result.
Asking for help, and here is my code below:
train:$lda.run(corpus)
the corpus is an RDD like this: $RDD[(Long, Vector)]
the Vector contains vocabulary, index of words, wordcounts.
predict:
def predict(documents: RDD[(Long, Vector)], ldaModel: LDAModel): Array[(Long, Vector)] = {
var docTopicsWeight = new Array[(Long, Vector)](documents.collect().length)
ldaModel match {
case localModel: LocalLDAModel =>
docTopicsWeight = localModel.topicDistributions(documents).collect()
case distModel: DistributedLDAModel =>
docTopicsWeight = distModel.toLocal.topicDistributions(documents).collect()
}
docTopicsWeight
}
回答1:
I'm not sure if your question actually concerns on why you were getting errors on your code but from I have understand, it seems first that you were using the default Vector class. Secondly, you can't use case class on the model directly. You'll need to use the isInstanceOf
and asInstanceOf
method for that.
def predict(documents: RDD[(Long, org.apache.spark.mllib.linalg.Vector)], ldaModel: LDAModel): Array[(Long, org.apache.spark.mllib.linalg.Vector)] = {
var docTopicsWeight = new Array[(Long, org.apache.spark.mllib.linalg.Vector)](documents.collect().length)
if (ldaModel.isInstanceOf[LocalLDAModel]) {
docTopicsWeight = ldaModel.asInstanceOf[LocalLDAModel].topicDistributions(documents).collect
} else if (ldaModel.isInstanceOf[DistributedLDAModel]) {
docTopicsWeight = ldaModel.asInstanceOf[DistributedLDAModel].toLocal.topicDistributions(documents).collect
}
docTopicsWeight
}
回答2:
Even this seems to be an old post, I recently found myself with the same issue. I think I understand the problem you are reporting.
If you try a very basic test of 2 small documents with only one word each document and 2 topics, using EM for training and then get topicDistributions from DistributedLDAModel, using the correct alpha and beta you can get the model to infer each document belongs to each of the topics i.e. document 1 - topic 1, document 2 - topic 2, in my case I made each document to be 0.998 probability for each document.
Running the same test but this time converting DistributedLDAModel to LocalLDAModel, the probability of each document to belong to one topic degrades to 0.666 (using same alpha and beta and number of topics).
So what I did next was to overload .toLocal method to accept a new alpha and beta and that way play with values that made it get closer to the first test but then I had more scenarios to cover and every time I had to modify the alpha parameter.
The conclusion in our team was that it doesn't seem right to try to predict with DistributedLDAModel converted to LocalLDAModel. https://github.com/rabarona/spark-shell-utils/tree/master/2.1.0/spark-mllib/DistributedLDAModel-to-LocalLDAModel
What was your conclusion? Did you find a solution?
Pd. This is just what I found running tests on small examples, if I'm missing something or if I'm saying something wrong, please let me know.
Code example:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.linalg.{Matrix, Vector, Vectors}
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.clustering._
import scala.collection.mutable
// Turn off warning messages:
Logger.getLogger("org").setLevel(Level.ERROR)
// Set number of topics
val numTopics: Int = 2
// Create corpus
val data: RDD[(String, String, Int)] = spark.sparkContext.parallelize(Seq(("cat fancy", "cat", 1),("dog world", "dog", 1)))
val corpus: RDD[Array[String]] = data.map({case (title: String, words: String, count: Int) => Array(words)})
val corpusSize: Long = corpus.count
val termCounts: Array[(String, Long)] = corpus.flatMap(_.map(_ -> 1L)).reduceByKey(_+_).collect.sortBy(-_._2)
val vocabArray: Array[String] = termCounts.takeRight(termCounts.size).map(_._1)
val vocab: Map[String, Int] = vocabArray.zipWithIndex.toMap
val documents: RDD[(Long, Vector)] =
corpus.zipWithIndex.map { case (tokens, id) =>
val counts = new mutable.HashMap[Int, Double]()
tokens.foreach { term =>
if (vocab.contains(term)) {
val idx = vocab(term)
counts(idx) = counts.getOrElse(idx, 0.0) + 1.0
}
}
(id, Vectors.sparse(vocab.size, counts.toSeq))
}
/*
Corpus
(0,(2,[1],[1.0]))
(1,(2,[0],[1.0]))
*/
// Set up EM LDA Optimizer
val emOptimizer: EMLDAOptimizer = new EMLDAOptimizer
// Set up Online LDA Optimizer
val onlineOptimizer: OnlineLDAOptimizer = new OnlineLDAOptimizer()
.setOptimizeDocConcentration(true)
.setMiniBatchFraction({
val corpusSize = corpus.count()
if (corpusSize < 2) 0.75
else (0.05 + 1) / corpusSize
})
// Run LDA using EM LDA Optimizer and get as instance of Distributed LDA Model
val distributedModel: DistributedLDAModel = new LDA().setK(numTopics).setMaxIterations(20).setAlpha(1.002).setBeta(1.001).setOptimizer(emOptimizer).run(documents).asInstanceOf[DistributedLDAModel]
distributedModel.topicsMatrix.toString(2, Int.MaxValue).split("\n")(0)
distributedModel.topicsMatrix.toString(2, Int.MaxValue).split("\n")(1)
println("***** Distributed LDA Model topic distributions *****")
distributedModel.topicDistributions.collect.foreach(println)
// Run LDA using Online LDA Optimizer and get as instance of Local LDA Model
val localModel: LocalLDAModel = new LDA().setK(numTopics).setMaxIterations(100).setAlpha(0.0009).setBeta(0.00001).setOptimizer(onlineOptimizer).run(documents).asInstanceOf[LocalLDAModel]
println("***** Local LDA Model topic distributions *****")
localModel.topicDistributions(documents).collect.foreach(println)
/*
documentid, topicDistributions
(0,[0.999997999996,2.0000040000157828E-6])
(1,[2.000004000015782E-6,0.999997999996])
*/
// Convert Distributed LDA Model to Local LDA Model
val convertedModel: LocalLDAModel = distributedModel.toLocal
println("***** Local LDA Model from Distributed LDA Model topic distributions *****")
println("Performance is affected due to document concentration still same as used for EM")
convertedModel.topicDistributions(documents).collect.foreach(println)
来源:https://stackoverflow.com/questions/33517649/the-accuracy-of-lda-predict-for-new-documents-with-spark