Understanding gensim word2vec's most_similar

為{幸葍}努か 提交于 2021-01-28 10:50:30

问题


I am unsure how I should use the most_similar method of gensim's Word2Vec. Let's say you want to test the tried-and-true example of: man stands to king as woman stands to X; find X. I thought that is what you could do with this method, but from the results I am getting I don't think that is true.

The documentation reads:

Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively.

This method computes cosine similarity between a simple mean of the projection weight vectors of the given words and the vectors for each word in the model. The method corresponds to the word-analogy and distance scripts in the original word2vec implementation.

I assume, then, that most_similar takes the positive examples and negative examples, and tries to find points in the vector space that are as close as possible to the positive vectors and as far away as possible from the negative ones. Is that correct?

Additionally, is there a method that allows us to map the relation between two points to another point and get the result (cf. the man-king woman-X example)?


回答1:


You can view exactly what most_similar() does in its source code:

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L485

It's not quite "find points in the vector space that are as close as possible to the positive vectors and as far away as possible from the negative ones". Rather, as described in the original word2vec papers, it performs vector arithmetic: adding the positive vectors, subtracting the negative, then from that resulting position, listing the known-vectors closest to that angle.

That is sufficient to solve man : king :: woman :: ?-style analogies, via a call like:

sims = wordvecs.most_similar(positive=['king', 'woman'], 
                             negative=['man'])

(You can think of this as, "start at 'king'-vector, add 'woman'-vector, subtract 'man'-vector, from where you wind up, report ranked word-vectors closest to that point (while leaving out any of the 3 query vectors).")



来源:https://stackoverflow.com/questions/54580260/understanding-gensim-word2vecs-most-similar

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!