probabilities returned by gensim's get_document_topics method doesn't add up to one

问题

Sometimes it returns probabilities for all topics and all is fine, but sometimes it returns probabilities for just a few topics and they don't add up to one, it seems it depends on the document. Generally when it returns few topics, the probabilities add up to more or less 80%, so is it returning just the most relevant topics? Is there a way to force it to return all probabilities?

Maybe I'm missing something but I can't find any documentation of the method's parameters.

回答1:

I had the same problem and solved it by including the argument minimum_probability=0 when calling the get_document_topics method of gensim.models.ldamodel.LdaModel objects.

    topic_assignments = lda.get_document_topics(corpus,minimum_probability=0)

By default, gensim doesn't output probabilities below 0.01, so for any document in particular, if there are any topics assigned probabilities under this threshold the sum of topic probabilities for that document will not add up to one.

Here's an example:

from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel

# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]

# Train the model on the corpus.
lda = LdaModel(common_corpus, num_topics=100)

# Try values of minimum_probability argument of None (default) and 0
for minimum_probability in (None, 0):
    # Get topic probabilites for each document
    topic_assignments = lda.get_document_topics(common_corpus,minimum_probability=minimum_probability)
    probabilities = [ [entry[1] for entry in doc] for doc in topic_assignments ]
    # Print output
    print(f"Calculating topic probabilities with minimum_probability argument = {str(minimum_probability)}")
    print(f"Sum of probabilites:")
    for i, P in enumerate(probabilities):
        sum_P = sum(P)
        print(f"\tdoc {i} = {sum_P}")

And the output would be:

Calculating topic probabilities with minimum_probability argument = None
Sum of probabilities:
    doc 0 = 0.6733324527740479
    doc 1 = 0.8585712909698486
    doc 2 = 0.7549994885921478
    doc 3 = 0.8019999265670776
    doc 4 = 0.7524996995925903
    doc 5 = 0
    doc 6 = 0
    doc 7 = 0
    doc 8 = 0.5049992203712463
Calculating topic probabilities with minimum_probability argument = 0
Sum of probabilites:
    doc 0 = 1.0000000400468707
    doc 1 = 1.0000000337604433
    doc 2 = 1.0000000079162419
    doc 3 = 1.0000000284053385
    doc 4 = 0.9999999937135726
    doc 5 = 0.9999999776482582
    doc 6 = 0.9999999776482582
    doc 7 = 0.9999999776482582
    doc 8 = 0.9999999930150807

This default behaviour is not very clearly stated in the documentation. The default value for minimum_probability for the get_document_topics method is None, however this does not set the probability to zero. Instead the value of minimum_probability is set to the value of minimum_probability of the gensim.models.ldamodel.LdaModel object, which by default is 0.01 as you can see in the source code:

def __init__(self, corpus=None, num_topics=100, id2word=None,
             distributed=False, chunksize=2000, passes=1, update_every=1,
             alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10,
             iterations=50, gamma_threshold=0.001, minimum_probability=0.01,
             random_state=None, ns_conf=None, minimum_phi_value=0.01,
             per_word_topics=False, callbacks=None, dtype=np.float32):

回答2:

I was working on LDA Topic Modeling and came across this post. I did create two topics let say topic1 and topic2.

The top 10 words for each topic are as follows: 0.009*"would" + 0.008*"experi" + 0.008*"need" + 0.007*"like" + 0.007*"code" + 0.007*"work" + 0.006*"think" + 0.006*"make" + 0.006*"one" + 0.006*"get

0.027*"ierr" + 0.018*"line" + 0.014*"0.0e+00" + 0.010*"error" + 0.009*"defin" + 0.009*"norm" + 0.006*"call" + 0.005*"type" + 0.005*"de" + 0.005*"warn

Eventually, I took 1 document for determining the closest topic.

for d in doc:
    bow = dictionary.doc2bow(d.split())
    t = lda.get_document_topics(bow)

and the output is [(0, 0.88935698141006414), (1, 0.1106430185899358)].

To answer your first question, the probabilities do add up to 1.0 for a document and that is what get_document_topics does. The document clearly states that it returns topic distribution for the given document bow, as a list of (topic_id, topic_probability) 2-tuples.

Further, I tried to get_term_topics for keyword "ierr"

t = lda.get_term_topics("ierr", minimum_probability=0.000001) and the result is [(1, 0.027292299843400435)] which is nothing but the word contribution for determining each topic, which makes sense.

So, you can label the document based on the topic distribution you get using get_document_topics and you can determine the importance of the word based on the contribution given by get_term_topics.

I hope this helps.

来源：https://stackoverflow.com/questions/44571617/probabilities-returned-by-gensims-get-document-topics-method-doesnt-add-up-to

标签

text-mining

gensim

lda

topic-modeling