Extract document-topic matrix from Pyspark LDA Model

后端 未结 3 2006
南旧
南旧 2021-02-01 22:55

I have successfully trained an LDA model in spark, via the Python API:

from pyspark.mllib.clustering import LDA
model=LDA.train(corpus,k=10)

Th

3条回答
  •  温柔的废话
    2021-02-01 23:15

    After extensive research, this is definitely not possible via the Python api on the current version of Spark (1.5.1). But in Scala, it's fairly straightforward (given an RDD documents on which to train):

    import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}
    
    // first generate RDD of documents...
    
    val numTopics = 10
    val lda = new LDA().setK(numTopics).setMaxIterations(10)
    val ldaModel = lda.run(documents)
    
    # then convert to distributed LDA model
    val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]
    

    Then getting the document topic distributions is as simple as:

    distLDAModel.topicDistributions
    

提交回复
热议问题