should i use tfidf corpus or just corpus to inference documents using LDA?

删除回忆录丶 提交于 2019-12-10 14:33:22

问题


i am just wondering whether its either TFIDF corpus to be used or just corpus to be used when we are inference documents using LDA in gensim

Here is an example

from gensim import corpora, models
import numpy.random
numpy.random.seed(10)

doc0 = [(0, 1), (1, 1)]
doc1 = [(0,1)] 
doc2 = [(0, 1), (1, 1)]
doc3 = [(0, 3), (1, 1)]

corpus = [doc0,doc1,doc2,doc3]
dictionary = corpora.Dictionary(corpus)

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
corpus_tfidf.save('x.corpus_tfidf')

corpus_tfidf = corpora.MmCorpus.load('x.corpus_tfidf')

lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)

#which one i should use from this   
**corpus_lda = lda[corpus]**          #this one 
**corpus_LDA = lda[corpus_tfidf ]**   #or this one?


corpus_lda.save('x.corpus_lda')

for i,j in enumerate(corpus_lda):
    print j, corpus[i]

回答1:


According to Gensim's mailing list (last post in particular) the standard procedure would be to use a bag of words corpus. You can use a TF-IDF corpus, but it seems to be unclear what kind of effect this would have.



来源:https://stackoverflow.com/questions/27147690/should-i-use-tfidf-corpus-or-just-corpus-to-inference-documents-using-lda

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!