问题
I have a trained Word2vec model using Python's Gensim Library. I have a tokenized list as below. The vocab size is 34 but I am just giving few out of 34:
b = ['let',
'know',
'buy',
'someth',
'featur',
'mashabl',
'might',
'earn',
'affili',
'commiss',
'fifti',
'year',
'ago',
'graduat',
'21yearold',
'dustin',
'hoffman',
'pull',
'asid',
'given',
'one',
'piec',
'unsolicit',
'advic',
'percent',
'buy']
Model
model = gensim.models.Word2Vec(b,min_count=1,size=32)
print(model)
### prints: Word2Vec(vocab=34, size=32, alpha=0.025) ####
if I try to get the similarity score by doing model['buy']
of one the words in the list, I get the
KeyError: "word 'buy' not in vocabulary"
Can you guys suggest me what I am doing wrong and what are the ways to check the model which can be further used to train PCA or t-sne in order to visualize similar words forming a topic? Thank you.
回答1:
The first parameter passed to gensim.models.Word2Vec
is an iterable of sentences. Sentences themselves are a list of words. From the docs:
Initialize the model from an iterable of
sentences
. Each sentence is a list of words (unicode strings) that will be used for training.
Right now, it thinks that each word in your list b
is a sentence and so it is doing Word2Vec
for each character in each word, as opposed to each word in your b
. Right now you can do:
model = gensim.models.Word2Vec(b,min_count=1,size=32)
print(model['a'])
array([ 7.42487283e-03, -5.65282721e-03, 1.28707094e-02, ... ]
To get it to work for words, simply wrap b
in another list so that it is interpreted correctly:
model = gensim.models.Word2Vec([b],min_count=1,size=32)
print(model['buy'])
array([-0.01331611, 0.00496594, -0.00165093, -0.01444992, 0.01393849, ... ]
回答2:
From the docs you need to pass iterable sentences so whatever you pass to the function it treats input as a iterable so here you are passing only words so it counts word2vec vector for each in charecter in the whole corpus.
So In order to avoid that problem, pass the list of words inside a list.
word2vec_model = gensim.models.Word2Vec([b],min_count=1,size=32)
来源:https://stackoverflow.com/questions/45420466/gensim-keyerror-word-not-in-vocabulary