word2vec - KeyError: “word X not in vocabulary”

大城市里の小女人 提交于 2020-01-25 08:22:27

问题


Using the Word2Vec implementation of the module gensim in order to construct word embeddings for the sentences I do have in a plain text file. Despite the word happy is defined in the vocabulary, getting the error KeyError: "word 'happy' not in vocabulary". Tried to apply the given the answers to a similar question, but did not work. Hence, posted my own question.

Here is the code:

try:
    data = []
    with open(TXT_PATH, 'r', encoding='utf-8') as txt_file:
        for line in txt_file:
            for part in line.split(' '):
                data.append(part.strip())

    # When I debug, both of the words 'happy' and 'birthday' exist in the variable 'data'
    word2vec = Word2Vec(data, min_count=5, size=10000, window=5, workers=4)

    # Print result
    word_1 = 'happy'
    word_2 = 'birthday'
    print(f'Similarity between {word_1} and {word_2} thru word2vec: {word2vec.similarity(word_1, word_2)}')
except Exception as err:
    print(f'An error happened! Detail: {str(err)}')

回答1:


When you get a "not in vocabulary" error like this from Word2Vec, you can trust it: 'happy' really isn't in the model.

Even if your visual check shows 'happy' inside your file, a few reasons why it might not wind up inside the model include:

  • it doesn't occur at least min_count=5 times

  • the data format isn't correct for Word2Vec, so it's not seeing the words you expect it to see.

Looking at how data is prepared by your code, it looks like a giant list of all words in your file. Word2Vec instead expects a sequence that has, as each item, a list-of-words for that one text. So: not a list-of-words, but a list where each item is a list-of-words.

If you've supplied...

[
  'happy',
  'birthday',
]

...instead of the expected...

[
  ['happy', 'birthday',],
]

...those single-word-strings will be seen a lists-of-characters, so Word2Vec will think you want to learn word-vectors for a bunch of one-character words. You can check if this has affected your model by seeing if the vocabulary size seems small (len(model.wv)) or if a sample of learned-words is only single-character words ('model.wv.index2entity[:10]`).

If you supply a word in the right format, at least min_count times, as part of the training-data, it will wind up with a vector in the model.

(Separately: size=10000 is a choice way outside the usual range of 100-400. I've never seen a project using such high-dimensionality for word-vectors, and it would only be theoretically justifiable if you had a massively-large vocabulary and training-set. Oversized vectors with smaller vocabularies/data are likely to create uselessly overfit results.)



来源:https://stackoverflow.com/questions/58666699/word2vec-keyerror-word-x-not-in-vocabulary

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!