I have successfully trained an LDA model in spark, via the Python API:
from pyspark.mllib.clustering import LDA
model=LDA.train(corpus,k=10)
Th
After extensive research, this is definitely not possible via the Python api on the current version of Spark (1.5.1). But in Scala, it's fairly straightforward (given an RDD documents
on which to train):
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}
// first generate RDD of documents...
val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)
val ldaModel = lda.run(documents)
# then convert to distributed LDA model
val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]
Then getting the document topic distributions is as simple as:
distLDAModel.topicDistributions