问题
I have trained 26 million tweets with skipgram technique to create word embeddings as follows:
sentences = gensim.models.word2vec.LineSentence('/.../data/tweets_26M.txt')
model = gensim.models.word2vec.Word2Vec(sentences, window=2, sg=1, size=200, iter=20)
model.save_word2vec_format('/.../savedModel/Tweets26M_All.model.bin', binary=True)
However, I am continuously collecting more tweets in my database. For example, when I have 2 million more tweets, I wanna update my embeddings with also considering this newcoming 2M tweets.
Is it possible to load previously trained model and update weights of embeddings (maybe adding new word embeddings to my model)? Or do I need to 28 (26+2) million tweets from beginning? It takes 5 hours with current parameters and will take longer with a bigger data.
One other question, is it possible to retrieve sentences parameter directly from database (instead of reading it from txt, bz2 or gz files)? As our data to be trained is getting bigger, it would be better to bypassing text read/write operations.
来源:https://stackoverflow.com/questions/40727093/gensim-word2vec-updating-word-embeddings-with-newcoming-data