问题
In the python difflib library, is the SequenceMatcher class behaving unexpectedly, or am I misreading what the supposed behavior is?
Why does the isjunk argument seem to not make any difference in this case?
difflib.SequenceMatcher(None, "AA", "A A").ratio() return 0.8
difflib.SequenceMatcher(lambda x: x in ' ', "AA", "A A").ratio() returns 0.8
My understanding is that if space is omitted, the ratio should be 1.
回答1:
This is happening because the ratio
function uses total sequences' length while calculating the ratio, but it doesn't filter elements using isjunk
. So, as long as the number of matches in the matching blocks results in the same value (with and without isjunk
), the ratio measure will be the same.
I assume that sequences are not filtered by isjunk
because of performance reasons.
def ratio(self):
"""Return a measure of the sequences' similarity (float in [0,1]).
Where T is the total number of elements in both sequences, and
M is the number of matches, this is 2.0*M / T.
"""
matches = sum(triple[-1] for triple in self.get_matching_blocks())
return _calculate_ratio(matches, len(self.a) + len(self.b))
self.a
and self.b
are the strings (sequences) passed to the SequenceMatcher object ("AA" and "A A" in your example). The isjunk
function lambda x: x in ' '
is only used to determine the matching blocks. Your example is quite simple, so the resulting ratio and matching blocks are the same for both calls.
difflib.SequenceMatcher(None, "AA", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=2, b=3, size=0)]
difflib.SequenceMatcher(lambda x: x == ' ', "AA", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=2, b=3, size=0)]
Same matching blocks, the ratio is: M = 2, T = 6 => ratio = 2.0 * 2 / 6
Now consider the following example:
difflib.SequenceMatcher(None, "AA ", "A A").get_matching_blocks()
[Match(a=1, b=0, size=2), Match(a=3, b=3, size=0)]
difflib.SequenceMatcher(lambda x: x == ' ', "AA ", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=3, b=3, size=0)]
Now matching blocks are different, but the ratio will be the same because the number of matches is still equal:
When isjunk
is None: M = 2, T = 6 => ratio = 2.0 * 2 / 6
When isjunk
is lambda x: x == ' '
: M = 1 + 1, T = 6 => ratio = 2.0 * 2 / 6
Finally, a different number of matches:
difflib.SequenceMatcher(None, "AA ", "A A ").get_matching_blocks()
[Match(a=1, b=0, size=2), Match(a=3, b=4, size=0)]
difflib.SequenceMatcher(lambda x: x == ' ', "AA ", "A A ").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=2), Match(a=3, b=4, size=0)]
The number of matches is different
When isjunk
is None: M = 2, T = 7 => ratio = 2.0 * 2 / 7
When isjunk
is lambda x: x == ' '
: M = 1 + 2, T = 6 => ratio = 2.0 * 3 / 7
来源:https://stackoverflow.com/questions/38129357/difflib-sequencematcher-isjunk-argument-not-considered