fine tuning pre-trained word2vec Google News

问题

I am currently using the Word2Vec model trained on Google News Corpus (from here) Since this is trained on news only until 2013, I need to updated the vectors and also add new words in the vocabulary based on the news coming after 2013.

Suppose I have a new corpus of news after 2013. Can I re-train or fine tune or update the Google News Word2Vec model? Can it be done using Gensim? Can it be done using FastText?

回答1:

You can have a look at this: https://github.com/facebookresearch/fastText/pull/423

It does exactly the same thing you want: Here is what the link says:

Training the classification model or word vector model incrementally.

./fasttext [supervised | skipgram | cbow] -input train.data -inputModel trained.model.bin -output re-trained [other options] -incr

-incr stands for incremental training.

When training word embedding, one could do it from scratch with all data at each time, or just on the new data. For classification, one could train it from scratch with pre-trained word embedding with all data, or only the new one, with no changing of the word embedding.

Incremental training actually means, having finished training model with data we got before, and retrain the model with newer data we get, not from scratch.