spacy similarity method doesn't not work correctly

问题

I always get a lot of help from stack overflows. Thank you all the time.

I am doing simple natural language processing using spacy. I'm working on filtering out words by measuring the similarity between words.

I wrote and used the following simple code shown in the spacy documentation, but the result does not look like a documentation.

import spacy
nlp = spacy.load('en_core_web_lg')
tokens = nlp('dog cat banana')

for token1 in tokens:
    for token2 in tokens:
        sim = token1.similarity(token2)
        print("{:>6s}, {:>6s}: {}".format(token1.text, token2.text, sim))

the result of code is below.

   dog,    dog: 1.0
   dog,    cat: 2.307269867164827e-21
   dog, banana: 0.0
   cat,    dog: 2.307269867164827e-21
   cat,    cat: 1.0
   cat, banana: -0.04468117654323578
banana,    dog: -7.828739256116838e+17
banana,    cat: -8.242222286053048e+17
banana, banana: 1.0

Especially, similarity between "dog" and "cat" should be about 0.8, but it is not a nd very very small value.

In addition, similarity between "dog" and "banana" is 0.0 but similarity between 'banana' and 'dog' is -7.828739256116838e+17.

I don't know how to fix it.

please help me.

回答1:

First install large EN model (or all models).

python3 -m spacy.en.download all

Next, try with sample code as per documentation using,

nlp = spacy.load('en_core_web_md')

If that doesnt work, Instead of above try loading,

nlp = spacy.load('en')

After doing above changes the result is as per documentation.

python3 /tmp/c.py
   dog,    dog: 1.000000078333395
   dog,    cat: 0.8016855098942641
   dog, banana: 0.2432764518408807
   cat,    dog: 0.8016855098942641
   cat,    cat: 1.0000001375986456
   cat, banana: 0.2815436412709355
banana,    dog: 0.2432764518408807
banana,    cat: 0.2815436412709355
banana, banana: 1.000000107068369