Spark 2.1.1: How to predict topics in unseen documents on already trained LDA model in Spark 2.1.1?

前端未结

关注

 1  407

轻奢々 2021-01-22 11:48

I am training an LDA model in pyspark (spark 2.1.1) on a customers review dataset. Now based on that model I want to predict the topics in the new unseen text.

I am usin

1条回答

[愿得一人] (楼主)

2021-01-22 12:27

You're going to need to pre-process the new data:

# import a new data set to be passed through the pre-trained LDA

data_new = pd.read_csv('YourNew.csv', encoding = "ISO-8859-1");
data_new = data_new.dropna()
data_text_new = data_new[['Your Target Column']]
data_text_new['index'] = data_text_new.index

documents_new = data_text_new
#documents_new = documents.dropna(subset=['Preprocessed Document'])

# process the new data set through the lemmatization, and stopwork functions
processed_docs_new = documents_new['Preprocessed Document'].map(preprocess)

# create a dictionary of individual words and filter the dictionary
dictionary_new = gensim.corpora.Dictionary(processed_docs_new[:])
dictionary_new.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

# define the bow_corpus
bow_corpus_new = [dictionary_new.doc2bow(doc) for doc in processed_docs_new]

Then you can just pass it through the trained LDA as a function. All you need is that bow_corpus:

ldamodel[bow_corpus_new[:len(bow_corpus_new)]]

If you want it out in a csv try this:

a = ldamodel[bow_corpus_new[:len(bow_corpus_new)]]
b = data_text_new

topic_0=[]
topic_1=[]
topic_2=[]

for i in a:
    topic_0.append(i[0][1])
    topic_1.append(i[1][1])
    topic_2.append(i[2][1])
    
d = {'Your Target Column': b['Your Target Column'].tolist(),
     'topic_0': topic_0,
     'topic_1': topic_1,
     'topic_2': topic_2}
     
df = pd.DataFrame(data=d)
df.to_csv("YourAllocated.csv", index=True, mode = 'a')

I hope this helps :)

0 讨论(0)