问题
i am just wondering whether its either TFIDF corpus to be used or just corpus to be used when we are inference documents using LDA in gensim
Here is an example
from gensim import corpora, models
import numpy.random
numpy.random.seed(10)
doc0 = [(0, 1), (1, 1)]
doc1 = [(0,1)]
doc2 = [(0, 1), (1, 1)]
doc3 = [(0, 3), (1, 1)]
corpus = [doc0,doc1,doc2,doc3]
dictionary = corpora.Dictionary(corpus)
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
corpus_tfidf.save('x.corpus_tfidf')
corpus_tfidf = corpora.MmCorpus.load('x.corpus_tfidf')
lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
#which one i should use from this
**corpus_lda = lda[corpus]** #this one
**corpus_LDA = lda[corpus_tfidf ]** #or this one?
corpus_lda.save('x.corpus_lda')
for i,j in enumerate(corpus_lda):
print j, corpus[i]
回答1:
According to Gensim's mailing list (last post in particular) the standard procedure would be to use a bag of words corpus. You can use a TF-IDF corpus, but it seems to be unclear what kind of effect this would have.
来源:https://stackoverflow.com/questions/27147690/should-i-use-tfidf-corpus-or-just-corpus-to-inference-documents-using-lda