Extract document-topic matrix from Pyspark LDA Model

后端未结

关注

 3  2013

南旧 2021-02-01 22:55

I have successfully trained an LDA model in spark, via the Python API:

from pyspark.mllib.clustering import LDA
model=LDA.train(corpus,k=10)

3条回答

独厮守ぢ (楼主)

2021-02-01 23:03

The following extends the above response for PySpark and Spark 2.0.

I hope you'll excuse me for posting this as a reply instead of as a comment, but I lack the rep at the moment.

I am assuming that you have a trained LDA model made from a corpus like so:

lda = LDA(k=NUM_TOPICS, optimizer="em")
ldaModel = lda.fit(corpus) # Where corpus is a dataframe with 'features'.

To convert a document into a topic distribution, we create a dataframe of the document ID and a vector (sparse is better) of the words.

documents = spark.createDataFrame([
    [123myNumericId, Vectors.sparse(len(words_in_our_corpus), {index_of_word:count}],
    [2, Vectors.sparse(len(words_in_our_corpus), {index_of_word:count, another:1.0}],
], schema=["id", "features"]
transformed = ldaModel.transform(documents)
dist = transformed.take(1)
# dist[0]['topicDistribution'] is now a dense vector of our topics.

0 讨论(0)

查看其它3个回答