gensim - Doc2Vec: MemoryError when training on english Wikipedia

问题

I extracted 145,185,965 sentences (14GB) out of the english wikipedia dump and I want to train a Doc2Vec model based on these sentences. Unfortunately I have 'only' 32GB of RAM and get a MemoryError when trying to train. Even if I set the min_count to 50, gensim tells me that it would need over 150GB of RAM. I don't think that further increasing the min_count would be a good idea, because the resulting model would be not very good (just a guess). But anyways, I will try it with 500 to see if memory is sufficient then.

Are there any possibilities to train such a large model with limited RAM?

Here is my current code:

corpus = TaggedLineDocument(preprocessed_text_file)
model = Doc2Vec(vector_size=300, 
                window=15, 
                min_count=50,  #1
                workers=16, 
                dm=0, 
                alpha=0.75, 
                min_alpha=0.001, 
                sample=0.00001,
                negative=5)
model.build_vocab(corpus)
model.train(corpus, 
            epochs=400, 
            total_examples=model.corpus_count, 
            start_alpha=0.025, 
            end_alpha=0.0001)

Are there maybe some obvious mistakes I am doing? Using it completely wrong?

I could also try reducing the vector size, but I think this will result in much worse results as most papers use 300D vectors.

回答1:

The required model size in addressable memory is largely a function of the number of weights required, by the number of unique words and unique doc-tags.

With 145,000,000 unique doc-tags, no matter how many words you limit yourself to, just the raw doc-vectors in-training alone will require:

145,000,000 * 300 dimensions * 4 bytes/dimension = 174GB

You could try a smaller data set. You could reduce the vector size. You could get more memory.

I would try one or more of those first, just to verify you're able to get things working and some initial results.

There is one trick, best considered experimental, that may work to allow training larger sets of doc-vectors, at some cost of extra complexity and lower performance: the docvecs_mapfile parameter of Doc2Vec.

Normally, you don't want a Word2Vec/Doc2Vec-style training session to use any virtual memory, because any recourse to slower disk IO makes training extremely slow. However, for a large doc-set, which is only ever iterated over in one order, the performance hit may be survivable after making the doc-vectors array to be backed by a memory-mapped file. Essentially, each training pass sweeps through the file from font-to-back, reading each section in once and paging it out once.

If you supply a docvecs_mapfile argument, Doc2Vec will allocate the doc-vectors array to be backed by that on-disk file. So you'll have a hundreds-of-GB file on disk (ideally SSD) whose ranges are paged in/out of RAM as necessary.

If you try this, be sure to experiment with this option on small runs first, to familiarize yourself with its operation, especially around saving/loading models.

Note also that if you then ever do a default most_similar() on doc-vectors, another 174GB array of unit-normalized vectors must be created from the raw array. (You can force that to be done in-place, clobbering the existing raw values, by explicitly calling the init_sims(replace=True) call before any other method requiring the unit-normed vectors is called.)

来源：https://stackoverflow.com/questions/50390455/gensim-doc2vec-memoryerror-when-training-on-english-wikipedia

标签

python

out-of-memory

gensim

doc2vec