Hierarchical Dirichlet Process Gensim topic number independent of corpus size

前端 未结 7 1704
余生分开走
余生分开走 2021-02-04 07:20

I am using the Gensim HDP module on a set of documents.

>>> hdp = models.HdpModel(corpusB, id2word=dictionaryB)
>>> topics = hdp.print_topics(         


        
7条回答
  •  无人及你
    2021-02-04 07:47

    Deriving the average coherence of HDP topics from their coherence at the individual text level is a way to order (and potentially truncate) them. The following function does just that:

    def order_subset_by_coherence(dirichlet_model, bow_corpus, num_topics=10, num_keywords=10):
        """
        Orders topics based on their average coherence across the corpus
    
        Parameters
        ----------
            dirichlet_model : gensim.models.hdpmodel.HdpModel
            bow_corpus : list of lists (contains (id, freq) tuples)
            num_topics : int (default=10)
            num_keywords : int (default=10)
    
        Returns
        -------
            ordered_topics: list of lists containing topic tokens
        """
        shown_topics = dirichlet_model.show_topics(num_topics=150, # return all topics
                                                   num_words=num_keywords,
                                                   formatted=False)
        model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]
        topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0 
    
        topics_per_response = [response for response in topic_corpus]
        flat_topic_coherences = [item for sublist in topics_per_response for item in sublist]
    
        significant_topics = list(set([t_c[0] for t_c in flat_topic_coherences])) # those that appear
        topic_averages = [sum([t_c[1] for t_c in flat_topic_coherences if t_c[0] == topic_num]) / len(bow_corpus) \
                          for topic_num in significant_topics]
    
        topic_indexes_by_avg_coherence = [tup[0] for tup in sorted(enumerate(topic_averages), key=lambda i:i[1])[::-1]]
        significant_topics_by_avg_coherence = [significant_topics[i] for i in topic_indexes_by_avg_coherence]
        ordered_topics = [model_topics[i] for i in significant_topics_by_avg_coherence][:num_topics] # truncate if desired
    
        return ordered_topics
    

    A version of this function that includes an output of the averages coherences associated with the topics for keyword (tag) generation for a corpus can be found in this answer. A similar process for keywords for individual texts can further be found in this answer.

提交回复
热议问题