问题
When I train my lda model as such
dictionary = corpora.Dictionary(data)
corpus = [dictionary.doc2bow(doc) for doc in data]
num_cores = multiprocessing.cpu_count()
num_topics = 50
lda = LdaMulticore(corpus, num_topics=num_topics, id2word=dictionary,
workers=num_cores, alpha=1e-5, eta=5e-1)
I want to get a full topic distribution for all num_topics
for each and every document. That is, in this particular case, I want each document to have 50 topics contributing to the distribution and I want to be able to access all 50 topics' contribution. This output is what LDA should do if adhering strictly to the mathematics of LDA. However, gensim only outputs topics that exceed a certain threshold as shown here. For example, if I try
lda[corpus[89]]
>>> [(2, 0.38951721864890398), (9, 0.15438596408262636), (37, 0.45607443684895665)]
which shows only 3 topics that contribute most to document 89. I have tried the solution in the link above, however this does not work for me. I still get the same output:
theta, _ = lda.inference(corpus)
theta /= theta.sum(axis=1)[:, None]
produces the same output i.e. only 2,3 topics per document.
My question is how do I change this threshold so I can access the FULL topic distribution for each document? How can I access the full topic distribution, no matter how insignificant the contribution of a topic to a document? The reason I want the full distribution is so I can perform a KL similarity search between documents' distribution.
Thanks in advance
回答1:
It doesnt seem that anyone has replied yet, so I'll try and answer this as best I can given the gensim documentation.
It seems you need to set a parameter minimum_probability
to 0.0 when training the model to get the desired results:
lda = LdaMulticore(corpus=corpus, num_topics=num_topics, id2word=dictionary, workers=num_cores, alpha=1e-5, eta=5e-1,
minimum_probability=0.0)
lda[corpus[233]]
>>> [(0, 5.8821799358842424e-07),
(1, 5.8821799358842424e-07),
(2, 5.8821799358842424e-07),
(3, 5.8821799358842424e-07),
(4, 5.8821799358842424e-07),
(5, 5.8821799358842424e-07),
(6, 5.8821799358842424e-07),
(7, 5.8821799358842424e-07),
(8, 5.8821799358842424e-07),
(9, 5.8821799358842424e-07),
(10, 5.8821799358842424e-07),
(11, 5.8821799358842424e-07),
(12, 5.8821799358842424e-07),
(13, 5.8821799358842424e-07),
(14, 5.8821799358842424e-07),
(15, 5.8821799358842424e-07),
(16, 5.8821799358842424e-07),
(17, 5.8821799358842424e-07),
(18, 5.8821799358842424e-07),
(19, 5.8821799358842424e-07),
(20, 5.8821799358842424e-07),
(21, 5.8821799358842424e-07),
(22, 5.8821799358842424e-07),
(23, 5.8821799358842424e-07),
(24, 5.8821799358842424e-07),
(25, 5.8821799358842424e-07),
(26, 5.8821799358842424e-07),
(27, 0.99997117731831464),
(28, 5.8821799358842424e-07),
(29, 5.8821799358842424e-07),
(30, 5.8821799358842424e-07),
(31, 5.8821799358842424e-07),
(32, 5.8821799358842424e-07),
(33, 5.8821799358842424e-07),
(34, 5.8821799358842424e-07),
(35, 5.8821799358842424e-07),
(36, 5.8821799358842424e-07),
(37, 5.8821799358842424e-07),
(38, 5.8821799358842424e-07),
(39, 5.8821799358842424e-07),
(40, 5.8821799358842424e-07),
(41, 5.8821799358842424e-07),
(42, 5.8821799358842424e-07),
(43, 5.8821799358842424e-07),
(44, 5.8821799358842424e-07),
(45, 5.8821799358842424e-07),
(46, 5.8821799358842424e-07),
(47, 5.8821799358842424e-07),
(48, 5.8821799358842424e-07),
(49, 5.8821799358842424e-07)]
回答2:
In case it may help someone else:
After training your LDA model, if you want to get all topics of a document, without limiting with a lower threshold, you should set minimum_probability to 0 when calling the get_document_topics method.
ldaModel.get_document_topics(bagOfWordOfADocument, minimum_probability=0.0)
来源:https://stackoverflow.com/questions/45310925/how-to-get-a-complete-topic-distribution-for-a-document-using-gensim-lda