问题
Simple code like follows gives out similarity score of 0.75 for both cases. As you can see both the words are the exact same. To avoid any confusion I also compared a word with itself. The score refuses to bulge from 0.75. What is going on here?
from nltk.corpus import wordnet as wn
actual=wn.synsets('orange')[0]
predicted=wn.synsets('orange')[0]
similarity=actual.wup_similarity(predicted)
print similarity
similarity=actual.wup_similarity(actual)
print similarity
回答1:
This is an interesting problem.
TL;DR:
Sorry there's no short answer to this problem =(
Too long, want to read:
Looking at the code for wup_similarity()
, the problem comes from not the similarity calculations but the way NLTK traverse the WordNet hierarchies to get the lowest_common_hypernym()
(see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805).
Normally, the lowest common hypernyms between a synset and itself would have to be itself:
>>> from nltk.corpus import wordnet as wn
>>> y = wn.synsets('car')[0]
>>> y.lowest_common_hypernyms(y, use_min_depth=True)
[Synset('car.n.01')]
But in the case of orange
it gives fruit
too:
>>> from nltk.corpus import wordnet as wn
>>> x = wn.synsets('orange')[0]
>>> x.lowest_common_hypernyms(x, use_min_depth=True)
[Synset('fruit.n.01'), Synset('orange.n.01')]
We'll have to take a look at the code for the lowest_common_hypernym()
, from the docstring of https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805
Get a list of lowest synset(s) that both synsets have as a hypernym. When
use_min_depth == False
this means that the synset which appears as a hypernym of bothself
andother
with the lowest maximum depth is returned or if there are multiple such synsets at the same depth they are all returned However, ifuse_min_depth == True
then the synset(s) which has/have the lowest minimum depth and appear(s) in both paths is/are returned
So let's try the lowest_common_hypernym()
with use_min_depth=False
:
>>> x.lowest_common_hypernyms(x, use_min_depth=False)
[Synset('orange.n.01')]
Seems like that resolves the ambiguity of the tied path. But the wup_similarity()
API doesn't have the use_min_depth
parameter:
>>> x.wup_similarity(x, use_min_depth=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: wup_similarity() got an unexpected keyword argument 'use_min_depth'
Note the difference is that when use_min_depth==False
, the lowest_common_hypernym checks for maximum depth while traversing synsets. But when use_min_depth==True
, it checks for minimum depth, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L602
So if we trace the lowest_common_hypernym code:
>>> synsets_to_search = x.common_hypernyms(x)
>>> synsets_to_search
[Synset('citrus.n.01'), Synset('natural_object.n.01'), Synset('orange.n.01'), Synset('object.n.01'), Synset('plant_organ.n.01'), Synset('edible_fruit.n.01'), Synset('produce.n.01'), Synset('food.n.02'), Synset('physical_entity.n.01'), Synset('entity.n.01'), Synset('reproductive_structure.n.01'), Synset('solid.n.01'), Synset('matter.n.03'), Synset('plant_part.n.01'), Synset('fruit.n.01'), Synset('whole.n.02')]
# if use_min_depth==True
>>> max_depth = max(x.min_depth() for x in synsets_to_search)
>>> max_depth
8
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.min_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01'), Synset('fruit.n.01')]
>>>
# if use_min_depth==False
>>> max_depth = max(x.max_depth() for x in synsets_to_search)
>>> max_depth
11
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.max_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[Synset('orange.n.01')]
This weird phenomena with wup_similarity
is actually highlighted in the code comments, https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843
# Note that to preserve behavior from NLTK2 we set use_min_depth=True
# It is possible that more accurate results could be obtained by
# removing this setting and it should be tested later on
subsumers = self.lowest_common_hypernyms(other, simulate_root=simulate_root and need_root, use_min_depth=True)
And when the first subsumer in the list is selected at https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843:
subsumer = subsumers[0]
Naturally, in the case of orange synset, fruit is selected first sense it's first of the list that have tied lowest common hypernyms.
To conclude, the default parameter is sort of a feature not a bug to maintain the reproducibility as with NLTK v2.x.
So the solution might be to either manually change the NLTK source to force use_min_depth=False
:
https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L845
EDITED
To resolve the problem, possibly you can do an ad-hoc check for same synset:
def wup_similarity_hacked(synset1, synset2):
if synset1 == synset2:
return 1.0
else:
return synset1.wup_similarity(synset2)
来源:https://stackoverflow.com/questions/32333996/python-nltk-wup-similarity-score-not-unity-for-exact-same-word