Gensim: KeyError: “word not in vocabulary”

佐手、 提交于 2019-12-03 11:43:11

问题


I have a trained Word2vec model using Python's Gensim Library. I have a tokenized list as below. The vocab size is 34 but I am just giving few out of 34:

b = ['let',
 'know',
 'buy',
 'someth',
 'featur',
 'mashabl',
 'might',
 'earn',
 'affili',
 'commiss',
 'fifti',
 'year',
 'ago',
 'graduat',
 '21yearold',
 'dustin',
 'hoffman',
 'pull',
 'asid',
 'given',
 'one',
 'piec',
 'unsolicit',
 'advic',
 'percent',
 'buy']

Model

model = gensim.models.Word2Vec(b,min_count=1,size=32)
print(model) 
### prints: Word2Vec(vocab=34, size=32, alpha=0.025) ####

if I try to get the similarity score by doing model['buy'] of one the words in the list, I get the

KeyError: "word 'buy' not in vocabulary"

Can you guys suggest me what I am doing wrong and what are the ways to check the model which can be further used to train PCA or t-sne in order to visualize similar words forming a topic? Thank you.


回答1:


The first parameter passed to gensim.models.Word2Vec is an iterable of sentences. Sentences themselves are a list of words. From the docs:

Initialize the model from an iterable of sentences. Each sentence is a list of words (unicode strings) that will be used for training.

Right now, it thinks that each word in your list b is a sentence and so it is doing Word2Vec for each character in each word, as opposed to each word in your b. Right now you can do:

model = gensim.models.Word2Vec(b,min_count=1,size=32)

print(model['a'])
array([  7.42487283e-03,  -5.65282721e-03,   1.28707094e-02, ... ]

To get it to work for words, simply wrap b in another list so that it is interpreted correctly:

model = gensim.models.Word2Vec([b],min_count=1,size=32)

print(model['buy'])
array([-0.01331611,  0.00496594, -0.00165093, -0.01444992,  0.01393849, ... ]



回答2:


From the docs you need to pass iterable sentences so whatever you pass to the function it treats input as a iterable so here you are passing only words so it counts word2vec vector for each in charecter in the whole corpus.

So In order to avoid that problem, pass the list of words inside a list.

word2vec_model = gensim.models.Word2Vec([b],min_count=1,size=32)


来源:https://stackoverflow.com/questions/45420466/gensim-keyerror-word-not-in-vocabulary

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!