Different models with gensim Word2Vec on python

我们两清 提交于 2021-01-28 14:02:40

问题


I am trying to apply the word2vec model implemented in the library gensim in python. I have a list of sentences (each sentences is a list of words).

For instance let us have:

sentences=[['first','second','third','fourth']]*n

and I implement two identical models:

model = gensim.models.Word2Vec(sententes, min_count=1,size=2)
model2=gensim.models.Word2Vec(sentences, min_count=1,size=2)

I realize that the models sometimes are the same, and sometimes are different, depending on the value of n.

For instance, if n=100 I obtain

print(model['first']==model2['first'])
True

while, for n=1000:

print(model['first']==model2['first'])
False

How is it possible?

Thank you very much!


回答1:


Looking at the gensim documentation, there is some randomization when you run Word2Vec:

seed = for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread, to eliminate ordering jitter from OS thread scheduling.

Thus if you want to have reproducible results, you will need to set the seed:

In [1]: import gensim

In [2]: sentences=[['first','second','third','fourth']]*1000

In [3]: model1 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2)

In [4]: model2 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2)

In [5]: print(all(model1['first']==model2['first']))
False

In [6]: model3 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2, seed = 1234)

In [7]: model4 = gensim.models.Word2Vec(sentences, min_count = 1, size = 2, seed = 1234)

In [11]: print(all(model3['first']==model4['first']))
True


来源:https://stackoverflow.com/questions/37745250/different-models-with-gensim-word2vec-on-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!