Extract document-topic matrix from Pyspark LDA Model

后端未结

关注

 3  2007

南旧 2021-02-01 22:55

I have successfully trained an LDA model in spark, via the Python API:

from pyspark.mllib.clustering import LDA
model=LDA.train(corpus,k=10)

3条回答

温柔的废话 (楼主)

2021-02-01 23:10

As of Spark 2.0 you can use transform() as a method from pyspark.ml.clustering.DistributedLDAModel. I just tried this on the 20 newsgroups dataset from scikit-learn and it works. See the returned vectors which is a distribution on topics for a document.

>>> test_results = ldaModel.transform(wordVecs)
Row(filename='/home/jovyan/work/data/20news_home/20news-bydate-test/rec.autos/103343', target=7, text='I am a little confused on all of the models of the 88-89 bonnevilles.\nI have heard of the LE SE LSE SSE SSEI. Could someone tell me the\ndifferences are far as features or performance. I am also curious to\nknow what the book value is for prefereably the 89 model. And how much\nless than book value can you usually get them for. In other words how\nmuch are they in demand this time of year. I have heard that the mid-spring\nearly summer is the best time to buy.', tokens=['little', 'confused', 'models', 'bonnevilles', 'someone', 'differences', 'features', 'performance', 'curious', 'prefereably', 'usually', 'demand', 'spring', 'summer'], vectors=SparseVector(10977, {28: 1.0, 29: 1.0, 152: 1.0, 301: 1.0, 496: 1.0, 552: 1.0, 571: 1.0, 839: 1.0, 1114: 1.0, 1281: 1.0, 1288: 1.0, 1624: 1.0}), topicDistribution=DenseVector([0.0462, 0.0538, 0.045, 0.0473, 0.0545, 0.0487, 0.0529, 0.0535, 0.0467, 0.0549, 0.051, 0.0466, 0.045, 0.0487, 0.0482, 0.0509, 0.054, 0.0472, 0.0547, 0.0501]))

0 讨论(0)

查看其它3个回答