How to incrementally train a word2vec model with new vocabularies

问题

I got a dataset over 40G. The program of my tokenizer is killed due to limited memory, so I try to split my dataset. How can I train the word2vec model incrementally, that is, how can I use separate datasets to train one word2vec model?

My current word2vec code is:

model = gensim.models.Word2Vec(documents, size=150, window=10, min_count=1, workers=10)
model.train(documents,total_examples=len(documents),epochs=epochs)
model.save("./word2vec150d/word2vec_{}.model".format(epochs))

Any help would be appreciated!

回答1:

I have found the solution: use PathLineSentences. It is very fast. Incrementally training a word2vec model cannot learn new vocabularies, but PathLineSentences can.

from gensim.models.word2vec import PathLineSentences

model = Word2Vec(PathLineSentences(input_dir), size=100, window=5, min_count=5, workers=multiprocessing.cpu_count() * 2, iter=20,sg=1)

For single file, use LineSentences.

from gensim.models.word2vec import LineSentence

model = Word2Vec(LineSentence(file), size=100, window=5, min_count=5, workers=multiprocessing.cpu_count() * 2, iter=20,sg=1)
...

来源：https://stackoverflow.com/questions/58925659/how-to-incrementally-train-a-word2vec-model-with-new-vocabularies

标签

python

word2vec

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!