fine tuning pre-trained word2vec Google News

安稳与你 提交于 2019-12-01 09:23:03

问题


I am currently using the Word2Vec model trained on Google News Corpus (from here) Since this is trained on news only until 2013, I need to updated the vectors and also add new words in the vocabulary based on the news coming after 2013.

Suppose I have a new corpus of news after 2013. Can I re-train or fine tune or update the Google News Word2Vec model? Can it be done using Gensim? Can it be done using FastText?


回答1:


You can have a look at this: https://github.com/facebookresearch/fastText/pull/423

It does exactly the same thing you want: Here is what the link says:

Training the classification model or word vector model incrementally.

./fasttext [supervised | skipgram | cbow] -input train.data -inputModel trained.model.bin -output re-trained [other options] -incr

-incr stands for incremental training.

When training word embedding, one could do it from scratch with all data at each time, or just on the new data. For classification, one could train it from scratch with pre-trained word embedding with all data, or only the new one, with no changing of the word embedding.

Incremental training actually means, having finished training model with data we got before, and retrain the model with newer data we get, not from scratch.




回答2:


Yes you can. I have been working on this too recently.

  • word2vec Reference
  • GloVe Reference

Edit: GloVe has a overhead of computing and storing co-occurence matrix in memory while training. Training word2vec is comparatively easy



来源:https://stackoverflow.com/questions/46244286/fine-tuning-pre-trained-word2vec-google-news

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!