I have successfully trained an LDA model in spark, via the Python API:
from pyspark.mllib.clustering import LDA
model=LDA.train(corpus,k=10)
Th
The following extends the above response for PySpark and Spark 2.0.
I hope you'll excuse me for posting this as a reply instead of as a comment, but I lack the rep at the moment.
I am assuming that you have a trained LDA model made from a corpus like so:
lda = LDA(k=NUM_TOPICS, optimizer="em")
ldaModel = lda.fit(corpus) # Where corpus is a dataframe with 'features'.
To convert a document into a topic distribution, we create a dataframe of the document ID and a vector (sparse is better) of the words.
documents = spark.createDataFrame([
[123myNumericId, Vectors.sparse(len(words_in_our_corpus), {index_of_word:count}],
[2, Vectors.sparse(len(words_in_our_corpus), {index_of_word:count, another:1.0}],
], schema=["id", "features"]
transformed = ldaModel.transform(documents)
dist = transformed.take(1)
# dist[0]['topicDistribution'] is now a dense vector of our topics.