Python NLTK WUP Similarity Score not unity for exact same word

问题

Simple code like follows gives out similarity score of 0.75 for both cases. As you can see both the words are the exact same. To avoid any confusion I also compared a word with itself. The score refuses to bulge from 0.75. What is going on here?

from nltk.corpus import wordnet as wn
actual=wn.synsets('orange')[0]
predicted=wn.synsets('orange')[0]
similarity=actual.wup_similarity(predicted)
print similarity
similarity=actual.wup_similarity(actual)
print similarity

回答1:

This is an interesting problem.

TL;DR:

Sorry there's no short answer to this problem =(

Too long, want to read:

Looking at the code for wup_similarity(), the problem comes from not the similarity calculations but the way NLTK traverse the WordNet hierarchies to get the lowest_common_hypernym() (see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805).

Normally, the lowest common hypernyms between a synset and itself would have to be itself:

>>> from nltk.corpus import wordnet as wn
>>> y = wn.synsets('car')[0]
>>> y.lowest_common_hypernyms(y, use_min_depth=True)
[Synset('car.n.01')]

But in the case of orange it gives fruit too:

>>> from nltk.corpus import wordnet as wn
>>> x = wn.synsets('orange')[0]
>>> x.lowest_common_hypernyms(x, use_min_depth=True)
[Synset('fruit.n.01'), Synset('orange.n.01')]

We'll have to take a look at the code for the lowest_common_hypernym(), from the docstring of https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805

Get a list of lowest synset(s) that both synsets have as a hypernym. When use_min_depth == False this means that the synset which appears as a hypernym of both self and other with the lowest maximum depth is returned or if there are multiple such synsets at the same depth they are all returned However, if use_min_depth == True then the synset(s) which has/have the lowest minimum depth and appear(s) in both paths is/are returned

So let's try the lowest_common_hypernym() with use_min_depth=False:

>>> x.lowest_common_hypernyms(x, use_min_depth=False)
[Synset('orange.n.01')]

Seems like that resolves the ambiguity of the tied path. But the wup_similarity() API doesn't have the use_min_depth parameter:

>>> x.wup_similarity(x, use_min_depth=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: wup_similarity() got an unexpected keyword argument 'use_min_depth'

Note the difference is that when use_min_depth==False, the lowest_common_hypernym checks for maximum depth while traversing synsets. But when use_min_depth==True, it checks for minimum depth, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L602

So if we trace the lowest_common_hypernym code:

>>> synsets_to_search = x.common_hypernyms(x)
>>> synsets_to_search
[Synset('citrus.n.01'), Synset('natural_object.n.01'), Synset('orange.n.01'), Synset('object.n.01'), Synset('plant_organ.n.01'), Synset('edible_fruit.n.01'), Synset('produce.n.01'), Synset('food.n.02'), Synset('physical_entity.n.01'), Synset('entity.n.01'), Synset('reproductive_structure.n.01'), Synset('solid.n.01'), Synset('matter.n.03'), Synset('plant_part.n.01'), Synset('fruit.n.01'), Synset('whole.n.02')]

# if use_min_depth==True
>>> max_depth = max(x.min_depth() for x in synsets_to_search)
>>> max_depth
8
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.min_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01'), Synset('fruit.n.01')]
>>> 
# if use_min_depth==False
>>> max_depth = max(x.max_depth() for x in synsets_to_search)
>>> max_depth
11
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.max_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01')]

This weird phenomena with wup_similarity is actually highlighted in the code comments, https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843

# Note that to preserve behavior from NLTK2 we set use_min_depth=True
# It is possible that more accurate results could be obtained by
# removing this setting and it should be tested later on
subsumers = self.lowest_common_hypernyms(other, simulate_root=simulate_root and need_root, use_min_depth=True)

And when the first subsumer in the list is selected at https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843:

subsumer = subsumers[0]

Naturally, in the case of orange synset, fruit is selected first sense it's first of the list that have tied lowest common hypernyms.

To conclude, the default parameter is sort of a feature not a bug to maintain the reproducibility as with NLTK v2.x.

So the solution might be to either manually change the NLTK source to force use_min_depth=False:

https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L845

EDITED

To resolve the problem, possibly you can do an ad-hoc check for same synset:

def wup_similarity_hacked(synset1, synset2):
  if synset1 == synset2:
    return 1.0
  else:
    return synset1.wup_similarity(synset2)

来源：https://stackoverflow.com/questions/32333996/python-nltk-wup-similarity-score-not-unity-for-exact-same-word

标签

python

nlp

nltk

similarity