How to convert gensim Word2Vec model to FastText model?

[亡魂溺海] 提交于 2019-12-06 02:57:43

FastText is able to create vectors for subword fragments by including those fragments in the initial training, from the original corpus. Then, when encountering an out-of-vocabulary ('OOV') word, it constructs a vector for those words using fragments it recognizes. For languages with recurring word-root/prefix/suffix patterns, this results in vectors that are better than random guesses for OOV words.

However, the FastText process does not extract these subword vectors from final full-word vectors. Thus there's no simple way to turn full-word vectors into a FastText model that also includes subword vectors.

There might be workable way to approximate the same effect, for example by taking all known-words with the same subword fragment, and extracting some common average/vector-component to be assigned to the subword. Or modeling OOV words as some average of in-vocabulary words that are a short edit-distance from the OOV word. But these techniques wouldn't quite be FastText, just vaguely analogous to it, and how well they work, or could be made to work with tweaking, would be an experimental question. So, it's not a matter of grabbing an off-the-shelf library.

There are a couple of research papers with other OOV-bootstrapping ideas, mentioned in this blog post by Sebastien Ruder.

If you need the FastText OOV functionality, the best-grounded approach would be to train FastText vectors from scratch on the same corpus as was used for your traditional full-word-vectors.

Here is the code snippet:

txt_model = KeyedVectors.load(model_name)
model.wv.save_word2vec_format('{}.txt'.format(model_name), binary=False)

Where model name is the name of the Word2Vec trained model.

However, gensim (since 3.2.0) has the following:

from gensim.models import FastText
model = FastText(sentences, workers=num_workers)
model.wv.save_word2vec_format('{}.txt'.format(model_name), binary=False)

BUT you'd still need to save it as a text file, because FastText cannot interpret binary word embeddings.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!