Term weighting for original LDA in gensim

六眼飞鱼酱① 提交于 2019-12-13 05:02:12

问题


I am using the gensim library to apply LDA to a set of documents. Using gensim I can apply LDA to a corpus whatever the term weights are: binary, tf, tf-idf...

My question is, what is the term weighting that should be used for the original LDA? If I have understood correctly the weights should be term frequencies, but I am not sure.


回答1:


It should be a corpus represented as a "bag of words". Or, yes, lists of term counts.

The correct format is that of the corpus defined in the first tutorial on the Gensim webpage (these are really useful).

Namely, if you have a dictionary as defined in Radim's tutorial, and the following documents,

doc1 = ['big', 'data', 'technique', 'lots', 'of', 'cash']
doc2 = ['this', 'document', 'has', 'words']
docs = [doc1, doc2]

then your corpus (for use with LDA) should be an iterable object (such as a list) of lists of tuples of the form: (dictKey, count), where dk refers to the dictionary key of a term, and count is the number of times it occurs in the document. This is done for you with

corpus = [dictionary.doc2bow(doc) for doc in docs]

That doc2bow function means "document to bag of words".



来源:https://stackoverflow.com/questions/25915441/term-weighting-for-original-lda-in-gensim

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!