Latent Dirichlet allocation (LDA) in Spark - replicate model

问题

I want to save the LDA model from pyspark ml-clustering package and apply the model to the training & test data-set after saving. However results diverge despite setting a seed. My code is the following:

1) Import packages

from pyspark.ml.clustering import LocalLDAModel, DistributedLDAModel
from pyspark.ml.feature import CountVectorizer , IDF

2) Preparing the dataset

countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())    
corpus = result_tfidf.select("id", "features")

3) Training the LDA model

lda = LDA(k=number_of_topics, maxIter=100, docConcentration = [alpha], topicConcentration = beta, seed = 123)
model = lda.fit(corpus)
model.save("LDA_model_saved")
topics = model.describeTopics(words_in_topic)  
topics_rdd = topics.rdd
modelled_corpus = model.transform(corpus)

4) Replicate the model

#Prepare the data set
countVectors = CountVectorizer(inputCol="requester_instruction_words_filtered_complete", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cv_model = countVectors.fit(tokenized_stopwords_sample_df)
result_tf = cv_model.transform(tokenized_stopwords_sample_df)
vocabArray = cv_model.vocabulary
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_tf)
result_tfidf = idfModel.transform(result_tf)   
result_tfidf = result_tfidf.withColumn("id", monotonically_increasing_id())
corpus_new = result_tfidf.select("id", "features")

#Load the model to apply to new corpus
newModel = LocalLDAModel.load("LDA_model_saved")
topics_new = newModel.describeTopics(words_in_topic)  
topics_rdd_new = topics_new.rdd
modelled_corpus_new = newModel.transform(corpus_new)

The following results are different despite my assumption to be equal: topics_rdd != topics_rdd_new and modelled_corpus != modelled_corpus_new (also when inspecting the extracted topics they are different as well as the predicted classes on the dataset)

So I find it really strange that the same model predicts different classes ("topics") on the same dataset, even though I set a seed in the model generation. Can someone with experience in replicating LDA models help?

Thank you :)

回答1:

I was facing similar kind of problem while implementing LDA in PYSPARK. Even though I was using seed, every time I re run the code on the same data with same parameters, results were different.

I came up with below solution after trying multitude of things:

Saved cv_model after running it once and loaded it in next iterations rather then re-fitting it.
This is more related to my data set. The size of some of the documents in the corpus that i was using was very small (around 3 words per document). I filtered out these documents and set a limit , such that only those documents will be included in corpus that have minimum 15 words (may be higher in yours). I am not sure why this one worked, may be something related underline complexity of model.

All in all now my results are same even after several iterations. Hope this helps.

来源：https://stackoverflow.com/questions/54515934/latent-dirichlet-allocation-lda-in-spark-replicate-model

标签

apache-spark

pyspark

lda