问题
Using the Word2Vec
implementation of the module gensim
in order to construct word embeddings for the sentences I do have in a plain text file. Despite the word happy
is defined in the vocabulary, getting the error KeyError: "word 'happy' not in vocabulary"
. Tried to apply the given the answers to a similar question, but did not work. Hence, posted my own question.
Here is the code:
try:
data = []
with open(TXT_PATH, 'r', encoding='utf-8') as txt_file:
for line in txt_file:
for part in line.split(' '):
data.append(part.strip())
# When I debug, both of the words 'happy' and 'birthday' exist in the variable 'data'
word2vec = Word2Vec(data, min_count=5, size=10000, window=5, workers=4)
# Print result
word_1 = 'happy'
word_2 = 'birthday'
print(f'Similarity between {word_1} and {word_2} thru word2vec: {word2vec.similarity(word_1, word_2)}')
except Exception as err:
print(f'An error happened! Detail: {str(err)}')
回答1:
When you get a "not in vocabulary" error like this from Word2Vec
, you can trust it: 'happy'
really isn't in the model.
Even if your visual check shows 'happy'
inside your file, a few reasons why it might not wind up inside the model include:
it doesn't occur at least
min_count=5
timesthe
data
format isn't correct forWord2Vec
, so it's not seeing the words you expect it to see.
Looking at how data
is prepared by your code, it looks like a giant list of all words in your file. Word2Vec
instead expects a sequence that has, as each item, a list-of-words for that one text. So: not a list-of-words, but a list where each item is a list-of-words.
If you've supplied...
[
'happy',
'birthday',
]
...instead of the expected...
[
['happy', 'birthday',],
]
...those single-word-strings will be seen a lists-of-characters, so Word2Vec
will think you want to learn word-vectors for a bunch of one-character words. You can check if this has affected your model by seeing if the vocabulary size seems small (len(model.wv)
) or if a sample of learned-words is only single-character words ('model.wv.index2entity[:10]`).
If you supply a word in the right format, at least min_count
times, as part of the training-data, it will wind up with a vector in the model.
(Separately: size=10000
is a choice way outside the usual range of 100-400. I've never seen a project using such high-dimensionality for word-vectors, and it would only be theoretically justifiable if you had a massively-large vocabulary and training-set. Oversized vectors with smaller vocabularies/data are likely to create uselessly overfit results.)
来源:https://stackoverflow.com/questions/58666699/word2vec-keyerror-word-x-not-in-vocabulary