How to specify word vector for OOV terms in Spacy?

后端 未结 1 1521
没有蜡笔的小新
没有蜡笔的小新 2021-01-23 17:14

I have a pre-trained word2vec model that I load to spacy to vectorize new words. Given new text I perform nlp(\'hi\').vector to obtain the vector for t

相关标签:
1条回答
  • 2021-01-23 17:47

    If you simply want your plug-vector instead of the SpaCy default all-zeros vector, you could just add an extra step where you replace any all-zeros vectors with yours. For example:

    words = ['words', 'may', 'by', 'fehlt']
    my_oov_vec = ...  # whatever you like
    spacy_vecs = [nlp(word) for word in words]
    fixed_vecs = [vec if vec.any() else my_oov_vec 
                  for vec in spacy_vecs]
    

    I'm not sure why you'd want to do this. Lots of work with word-vectors simply elides out-of-vocabulary words; using any plug value, including SpaCy's zero-vector, may just be adding unhelpful noise.

    And if better handling of OOV words is important, note that some other word-vector models, like FastText, can synthesize better-than-nothing guess-vectors for OOV words, by using vectors learned for subword fragments during training. That's similar to how people can often work out the gist of a word from familiar word-roots.

    0 讨论(0)
提交回复
热议问题