发表新帖

发表新帖

Extract document-topic matrix from Pyspark LDA Model

后端未结

关注

 3  2006

南旧 2021-02-01 22:55

I have successfully trained an LDA model in spark, via the Python API:

from pyspark.mllib.clustering import LDA
model=LDA.train(corpus,k=10)

Th

3条回答

温柔的废话 (楼主)

2021-02-01 23:15
After extensive research, this is definitely not possible via the Python api on the current version of Spark (1.5.1). But in Scala, it's fairly straightforward (given an RDD documents on which to train):
```
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}

// first generate RDD of documents...

val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)
val ldaModel = lda.run(documents)

# then convert to distributed LDA model
val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]
```
Then getting the document topic distributions is as simple as:
```
distLDAModel.topicDistributions
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题