问题
Sometimes it returns probabilities for all topics and all is fine, but sometimes it returns probabilities for just a few topics and they don't add up to one, it seems it depends on the document. Generally when it returns few topics, the probabilities add up to more or less 80%, so is it returning just the most relevant topics? Is there a way to force it to return all probabilities?
Maybe I'm missing something but I can't find any documentation of the method's parameters.
回答1:
I had the same problem and solved it by including the argument minimum_probability=0
when calling the get_document_topics
method of gensim.models.ldamodel.LdaModel
objects.
topic_assignments = lda.get_document_topics(corpus,minimum_probability=0)
By default, gensim doesn't output probabilities below 0.01, so for any document in particular, if there are any topics assigned probabilities under this threshold the sum of topic probabilities for that document will not add up to one.
Here's an example:
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
# Train the model on the corpus.
lda = LdaModel(common_corpus, num_topics=100)
# Try values of minimum_probability argument of None (default) and 0
for minimum_probability in (None, 0):
# Get topic probabilites for each document
topic_assignments = lda.get_document_topics(common_corpus,minimum_probability=minimum_probability)
probabilities = [ [entry[1] for entry in doc] for doc in topic_assignments ]
# Print output
print(f"Calculating topic probabilities with minimum_probability argument = {str(minimum_probability)}")
print(f"Sum of probabilites:")
for i, P in enumerate(probabilities):
sum_P = sum(P)
print(f"\tdoc {i} = {sum_P}")
And the output would be:
Calculating topic probabilities with minimum_probability argument = None
Sum of probabilities:
doc 0 = 0.6733324527740479
doc 1 = 0.8585712909698486
doc 2 = 0.7549994885921478
doc 3 = 0.8019999265670776
doc 4 = 0.7524996995925903
doc 5 = 0
doc 6 = 0
doc 7 = 0
doc 8 = 0.5049992203712463
Calculating topic probabilities with minimum_probability argument = 0
Sum of probabilites:
doc 0 = 1.0000000400468707
doc 1 = 1.0000000337604433
doc 2 = 1.0000000079162419
doc 3 = 1.0000000284053385
doc 4 = 0.9999999937135726
doc 5 = 0.9999999776482582
doc 6 = 0.9999999776482582
doc 7 = 0.9999999776482582
doc 8 = 0.9999999930150807
This default behaviour is not very clearly stated in the documentation. The default value for minimum_probability
for the get_document_topics
method is None
, however this does not set the probability to zero. Instead the value of minimum_probability
is set to the value of minimum_probability
of the gensim.models.ldamodel.LdaModel
object, which by default is 0.01 as you can see in the source code:
def __init__(self, corpus=None, num_topics=100, id2word=None,
distributed=False, chunksize=2000, passes=1, update_every=1,
alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10,
iterations=50, gamma_threshold=0.001, minimum_probability=0.01,
random_state=None, ns_conf=None, minimum_phi_value=0.01,
per_word_topics=False, callbacks=None, dtype=np.float32):
回答2:
I was working on LDA Topic Modeling and came across this post. I did create two topics let say topic1 and topic2.
The top 10 words for each topic are as follows: 0.009*"would" + 0.008*"experi" + 0.008*"need" + 0.007*"like" + 0.007*"code" + 0.007*"work" + 0.006*"think" + 0.006*"make" + 0.006*"one" + 0.006*"get
0.027*"ierr" + 0.018*"line" + 0.014*"0.0e+00" + 0.010*"error" + 0.009*"defin" + 0.009*"norm" + 0.006*"call" + 0.005*"type" + 0.005*"de" + 0.005*"warn
Eventually, I took 1 document for determining the closest topic.
for d in doc:
bow = dictionary.doc2bow(d.split())
t = lda.get_document_topics(bow)
and the output is [(0, 0.88935698141006414), (1, 0.1106430185899358)]
.
To answer your first question, the probabilities do add up to 1.0 for a document and that is what get_document_topics does. The document clearly states that it returns topic distribution for the given document bow, as a list of (topic_id, topic_probability) 2-tuples.
Further, I tried to get_term_topics for keyword "ierr"
t = lda.get_term_topics("ierr", minimum_probability=0.000001)
and the result is [(1, 0.027292299843400435)]
which is nothing but the word contribution for determining each topic, which makes sense.
So, you can label the document based on the topic distribution you get using get_document_topics and you can determine the importance of the word based on the contribution given by get_term_topics.
I hope this helps.
来源:https://stackoverflow.com/questions/44571617/probabilities-returned-by-gensims-get-document-topics-method-doesnt-add-up-to