Can we use a self made corpus for training for LDA using gensim?

醉酒当歌 提交于 2019-12-03 07:40:28

问题


I have to apply LDA (Latent Dirichlet Allocation) to get the possible topics from a data base of 20,000 documents that I collected.

How can I use these documents rather than the other corpus available like the Brown Corpus or English Wikipedia as training corpus ?

You can refer this page.


回答1:


After going through the documentation of the Gensim package, I found out that there are total 4 ways of transforming a text repository into a corpus.

There are total 4 formats for the corpus:

  1. Market Matrix (.mm)
  2. SVM Light (.svmlight)
  3. Blie Format (.lad-c)
  4. Low Format (.low)

In this problem, as mentioned above there are total of 19,188 documents in the database. One has to read each document and remove stopwords and punctuations from the sentences, which can be done using nltk.

import gensim
from gensim import corpora, similarities, models

##
##Text Preprocessing is done here using nltk
##

##Saving of the dictionary and corpus is done here
##final_text contains the tokens of all the documents

dictionary = corpora.Dictionary(final_text)
dictionary.save('questions.dict');
corpus = [dictionary.doc2bow(text) for text in final_text]
corpora.MmCorpus.serialize('questions.mm', corpus)
corpora.SvmLightCorpus.serialize('questions.svmlight', corpus)
corpora.BleiCorpus.serialize('questions.lda-c', corpus)
corpora.LowCorpus.serialize('questions.low', corpus)

##Then the dictionary and corpus can be used to train using LDA

mm = corpora.MmCorpus('questions.mm')
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=dictionary, num_topics=100, update_every=0, chunksize=19188, passes=20)

This way one can transform his dataset to a corpus that can be trained for topic modelling using LDA using gensim package.



来源:https://stackoverflow.com/questions/16254207/can-we-use-a-self-made-corpus-for-training-for-lda-using-gensim

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!