问题
Today I just started writing an script which trains LDA models on large corpora (minimum 30M sentences) using gensim library. Here is the current code that I am using:
from gensim import corpora, models, similarities, matutils
def train_model(fname):
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
dictionary = corpora.Dictionary(line.lower().split() for line in open(fname))
print "DOC2BOW"
corpus = [dictionary.doc2bow(line.lower().split()) for line in open(fname)]
print "running LDA"
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=100, update_every=1, chunksize=10000, asses=1)
running this script on a small corpus (2M sentences) I realized that it needs about 7GB of RAM. And when I try to run it on the larger corpora, it fails because of the memory issue. The problem is obviously due to the fact that I am loading the corpus using this command:
corpus = [dictionary.doc2bow(line.lower().split()) for line in open(fname)]
But, I think there is no other way because I would need it for calling the LdaModel() method:
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=100, update_every=1, chunksize=10000, asses=1)
I searched for a solution to this problem but I could not find anything helpful. I would imagine that it should be a common problem since we mostly train the models on very large corpora (usually wikipedia documents). So, it should be already a solution for it.
Any ideas about this issue and the solution for it?
回答1:
Consider wrapping your corpus
up as an iterable and passing that instead of a list (a generator will not work).
From the tutorial:
class MyCorpus(object):
def __iter__(self):
for line in open(fname):
# assume there's one document per line, tokens separated by whitespace
yield dictionary.doc2bow(line.lower().split())
corpus = MyCorpus()
lda = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=dictionary,
num_topics=100,
update_every=1,
chunksize=10000,
passes=1)
Additionally, Gensim has several different corpus formats readily available, which can be found in the API reference. You might consider using TextCorpus
, which should fit your format nicely already:
corpus = gensim.corpora.TextCorpus(fname)
lda = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=corpus.dictionary, # TextCorpus can build the dictionary for you
num_topics=100,
update_every=1,
chunksize=10000,
passes=1)
来源:https://stackoverflow.com/questions/35609171/memory-efficient-lda-training-using-gensim-library