Spark 2.1.1: How to predict topics in unseen documents on already trained LDA model in Spark 2.1.1?

前端 未结 1 407
轻奢々
轻奢々 2021-01-22 11:48

I am training an LDA model in pyspark (spark 2.1.1) on a customers review dataset. Now based on that model I want to predict the topics in the new unseen text.

I am usin

1条回答
  •  [愿得一人]
    2021-01-22 12:27

    You're going to need to pre-process the new data:

    # import a new data set to be passed through the pre-trained LDA
    
    data_new = pd.read_csv('YourNew.csv', encoding = "ISO-8859-1");
    data_new = data_new.dropna()
    data_text_new = data_new[['Your Target Column']]
    data_text_new['index'] = data_text_new.index
    
    documents_new = data_text_new
    #documents_new = documents.dropna(subset=['Preprocessed Document'])
    
    # process the new data set through the lemmatization, and stopwork functions
    processed_docs_new = documents_new['Preprocessed Document'].map(preprocess)
    
    # create a dictionary of individual words and filter the dictionary
    dictionary_new = gensim.corpora.Dictionary(processed_docs_new[:])
    dictionary_new.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
    
    # define the bow_corpus
    bow_corpus_new = [dictionary_new.doc2bow(doc) for doc in processed_docs_new]

    Then you can just pass it through the trained LDA as a function. All you need is that bow_corpus:

    ldamodel[bow_corpus_new[:len(bow_corpus_new)]]

    If you want it out in a csv try this:

    a = ldamodel[bow_corpus_new[:len(bow_corpus_new)]]
    b = data_text_new
    
    topic_0=[]
    topic_1=[]
    topic_2=[]
    
    for i in a:
        topic_0.append(i[0][1])
        topic_1.append(i[1][1])
        topic_2.append(i[2][1])
        
    d = {'Your Target Column': b['Your Target Column'].tolist(),
         'topic_0': topic_0,
         'topic_1': topic_1,
         'topic_2': topic_2}
         
    df = pd.DataFrame(data=d)
    df.to_csv("YourAllocated.csv", index=True, mode = 'a')

    I hope this helps :)

    0 讨论(0)
提交回复
热议问题