Hierarchical Dirichlet Process Gensim topic number independent of corpus size

前端 未结 7 1714
余生分开走
余生分开走 2021-02-04 07:20

I am using the Gensim HDP module on a set of documents.

>>> hdp = models.HdpModel(corpusB, id2word=dictionaryB)
>>> topics = hdp.print_topics(         


        
7条回答
  •  梦如初夏
    2021-02-04 07:58

    @Aron's and @Roko Mijic's approaches neglect the fact that the function show_topics returns by default the top 20 words of each topic only. If one returns all the words that compose a topic, all the approximated topic probabilities in that case will be 1 (or 0.999999). I experimented with the following code, which is an adaptation of @Roko Mijic's:

    def topic_prob_extractor(gensim_hdp, t=-1, w=25, isSorted=True):
        """
        Input the gensim model to get the rough topics' probabilities
        """
        shown_topics = gensim_hdp.show_topics(num_topics=t, num_words=w ,formatted=False)
        topics_nos = [x[0] for x in shown_topics ]
        weights = [ sum([item[1] for item in shown_topics[topicN][1]]) for topicN in topics_nos ]
        if (isSorted):
            return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights}).sort_values(by = "weight", ascending=False);
        else:
            return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights});
    

    A better, yet I'm not sure if 100% valid, approach is the one mentioned here. You can get the topics' true weights (alpha vector) of the HDP model as:

    alpha = hdpModel.hdp_to_lda()[0];
    

    Examining the topics' equivalent alpha values is more logical than tallying up the weights of the first 20 words of each topic to approximate its probability of usage in the data.

提交回复
热议问题