What is the best way to obtain the optimal number of topics for a LDA-Model using Gensim?

前端 未结 2 1659
北海茫月
北海茫月 2020-12-20 16:16

I am trying to obtain the optimal number of topics for an LDA-model within Gensim. One method I found is to calculate the log likelihood for each model and compare each agai

相关标签:
2条回答
  • 2020-12-20 16:26

    Although I cannot comment on Gensim in particular I can weigh in with some general advice for optimising your topics.

    As you stated, using log likelihood is one method. Another option is to keep a set of documents held out from the model generation process and infer topics over them when the model is complete and check if it makes sense.

    A completely different method you could try is a hierarchical Dirichlet process, this method can find the number of topics in the corpus dynamically without being specified.

    There are many papers on how to best specify parameters and evaluate your topic model, depending on your experience level these may or may not be good for you:

    Rethinking LDA: Why Priors Matter, Wallach, H.M., Mimno, D. and McCallum, A.

    Evaluation Methods for Topic Models, Wallach H.M., Murray, I., Salakhutdinov, R. and Mimno, D.

    Also, here is the paper about the hierarchical Dirichlet process:

    Hierarchical Dirichlet Processes, Teh, Y.W., Jordan, M.I., Beal, M.J. and Blei, D.M.

    0 讨论(0)
  • 2020-12-20 16:52

    A general rule of thumb is to create LDA models across different topic numbers, and then check the Jaccard similarity and coherence for each. Coherence in this case measures a single topic by the degree of semantic similarity between high scoring words in the topic (do these words co-occur across the text corpus). The following will give a strong intuition for the optimal number of topics. This should be a baseline before jumping to the hierarchical Dirichlet process, as that technique has been found to have issues in practical applications.

    Start by creating dictionaries for models and topic words for the various topic numbers you want to consider, where in this case corpus is the cleaned tokens, num_topics is a list of topics you want to consider, and num_words is the number of top words per topic that you want to be considered for the metrics:

    import seaborn as sns
    import matplotlib.pyplot as plt
    import numpy as np
    from gensim.models import LdaModel, CoherenceModel
    from gensim import corpora
    
    dirichlet_dict = corpora.Dictionary(corpus)
    bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]
    
    # Considering 1-15 topics, as the last is cut off
    num_topics = list(range(16)[1:])
    num_keywords = 15
    
    LDA_models = {}
    LDA_topics = {}
    for i in num_topics:
        LDA_models[i] = LdaModel(corpus=bow_corpus,
                                 id2word=dirichlet_dict,
                                 num_topics=i,
                                 update_every=1,
                                 chunksize=len(bow_corpus),
                                 passes=20,
                                 alpha='auto',
                                 random_state=42)
    
        shown_topics = LDA_models[i].show_topics(num_topics=i, 
                                                 num_words=num_keywords,
                                                 formatted=False)
        LDA_topics[i] = [[word[0] for word in topic[1]] for topic in shown_topics]
    

    Now create a function to derive the Jaccard similarity of two topics:

    def jaccard_similarity(topic_1, topic_2):
        """
        Derives the Jaccard similarity of two topics
    
        Jaccard similarity:
        - A statistic used for comparing the similarity and diversity of sample sets
        - J(A,B) = (A ∩ B)/(A ∪ B)
        - Goal is low Jaccard scores for coverage of the diverse elements
        """
        intersection = set(topic_1).intersection(set(topic_2))
        union = set(topic_1).union(set(topic_2))
                        
        return float(len(intersection))/float(len(union))
    

    Use the above to derive the mean stability across topics by considering the next topic:

    LDA_stability = {}
    for i in range(0, len(num_topics)-1):
        jaccard_sims = []
        for t1, topic1 in enumerate(LDA_topics[num_topics[i]]): # pylint: disable=unused-variable
            sims = []
            for t2, topic2 in enumerate(LDA_topics[num_topics[i+1]]): # pylint: disable=unused-variable
                sims.append(jaccard_similarity(topic1, topic2))    
            
            jaccard_sims.append(sims)    
        
        LDA_stability[num_topics[i]] = jaccard_sims
                    
    mean_stabilities = [np.array(LDA_stability[i]).mean() for i in num_topics[:-1]]
    

    gensim has a built in model for topic coherence (this uses the 'c_v' option):

    coherences = [CoherenceModel(model=LDA_models[i], texts=corpus, dictionary=dirichlet_dict, coherence='c_v').get_coherence()\
                  for i in num_topics[:-1]]
    

    From here derive the ideal number of topics roughly through the difference between the coherence and stability per number of topics:

    coh_sta_diffs = [coherences[i] - mean_stabilities[i] for i in range(num_keywords)[:-1]] # limit topic numbers to the number of keywords
    coh_sta_max = max(coh_sta_diffs)
    coh_sta_max_idxs = [i for i, j in enumerate(coh_sta_diffs) if j == coh_sta_max]
    ideal_topic_num_index = coh_sta_max_idxs[0] # choose less topics in case there's more than one max
    ideal_topic_num = num_topics[ideal_topic_num_index]
    

    Finally graph these metrics across the topic numbers:

    plt.figure(figsize=(20,10))
    ax = sns.lineplot(x=num_topics[:-1], y=mean_stabilities, label='Average Topic Overlap')
    ax = sns.lineplot(x=num_topics[:-1], y=coherences, label='Topic Coherence')
    
    ax.axvline(x=ideal_topic_num, label='Ideal Number of Topics', color='black')
    ax.axvspan(xmin=ideal_topic_num - 1, xmax=ideal_topic_num + 1, alpha=0.5, facecolor='grey')
    
    y_max = max(max(mean_stabilities), max(coherences)) + (0.10 * max(max(mean_stabilities), max(coherences)))
    ax.set_ylim([0, y_max])
    ax.set_xlim([1, num_topics[-1]-1])
                    
    ax.axes.set_title('Model Metrics per Number of Topics', fontsize=25)
    ax.set_ylabel('Metric Level', fontsize=20)
    ax.set_xlabel('Number of Topics', fontsize=20)
    plt.legend(fontsize=20)
    plt.show()   
    

    Your ideal number of topics will maximize coherence and minimize the topic overlap based on Jaccard similarity. In this case it looks like we'd be safe choosing topic numbers around 14.

    0 讨论(0)
提交回复
热议问题