Extract document-topic matrix from Pyspark LDA Model

后端 未结 3 2004
南旧
南旧 2021-02-01 22:55

I have successfully trained an LDA model in spark, via the Python API:

from pyspark.mllib.clustering import LDA
model=LDA.train(corpus,k=10)

Th

3条回答
  •  独厮守ぢ
    2021-02-01 23:03

    The following extends the above response for PySpark and Spark 2.0.

    I hope you'll excuse me for posting this as a reply instead of as a comment, but I lack the rep at the moment.

    I am assuming that you have a trained LDA model made from a corpus like so:

    lda = LDA(k=NUM_TOPICS, optimizer="em")
    ldaModel = lda.fit(corpus) # Where corpus is a dataframe with 'features'.
    

    To convert a document into a topic distribution, we create a dataframe of the document ID and a vector (sparse is better) of the words.

    documents = spark.createDataFrame([
        [123myNumericId, Vectors.sparse(len(words_in_our_corpus), {index_of_word:count}],
        [2, Vectors.sparse(len(words_in_our_corpus), {index_of_word:count, another:1.0}],
    ], schema=["id", "features"]
    transformed = ldaModel.transform(documents)
    dist = transformed.take(1)
    # dist[0]['topicDistribution'] is now a dense vector of our topics.
    

提交回复
热议问题