difflib.SequenceMatcher isjunk argument not considered?

喜你入骨 提交于 2019-12-12 13:08:10

问题


In the python difflib library, is the SequenceMatcher class behaving unexpectedly, or am I misreading what the supposed behavior is?

Why does the isjunk argument seem to not make any difference in this case?

difflib.SequenceMatcher(None, "AA", "A A").ratio() return 0.8

difflib.SequenceMatcher(lambda x: x in ' ', "AA", "A A").ratio() returns 0.8

My understanding is that if space is omitted, the ratio should be 1.


回答1:


This is happening because the ratio function uses total sequences' length while calculating the ratio, but it doesn't filter elements using isjunk. So, as long as the number of matches in the matching blocks results in the same value (with and without isjunk), the ratio measure will be the same.

I assume that sequences are not filtered by isjunk because of performance reasons.

def ratio(self):   
    """Return a measure of the sequences' similarity (float in [0,1]).

    Where T is the total number of elements in both sequences, and
    M is the number of matches, this is 2.0*M / T.
    """

    matches = sum(triple[-1] for triple in self.get_matching_blocks())
    return _calculate_ratio(matches, len(self.a) + len(self.b))

self.a and self.b are the strings (sequences) passed to the SequenceMatcher object ("AA" and "A A" in your example). The isjunk function lambda x: x in ' ' is only used to determine the matching blocks. Your example is quite simple, so the resulting ratio and matching blocks are the same for both calls.

difflib.SequenceMatcher(None, "AA", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=2, b=3, size=0)]

difflib.SequenceMatcher(lambda x: x == ' ', "AA", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=2, b=3, size=0)]

Same matching blocks, the ratio is: M = 2, T = 6 => ratio = 2.0 * 2 / 6

Now consider the following example:

difflib.SequenceMatcher(None, "AA ", "A A").get_matching_blocks()
[Match(a=1, b=0, size=2), Match(a=3, b=3, size=0)]

difflib.SequenceMatcher(lambda x: x == ' ', "AA ", "A A").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=1), Match(a=3, b=3, size=0)]

Now matching blocks are different, but the ratio will be the same because the number of matches is still equal:

When isjunk is None: M = 2, T = 6 => ratio = 2.0 * 2 / 6

When isjunk is lambda x: x == ' ': M = 1 + 1, T = 6 => ratio = 2.0 * 2 / 6

Finally, a different number of matches:

difflib.SequenceMatcher(None, "AA ", "A A ").get_matching_blocks()
[Match(a=1, b=0, size=2), Match(a=3, b=4, size=0)]

difflib.SequenceMatcher(lambda x: x == ' ', "AA ", "A A ").get_matching_blocks()
[Match(a=0, b=0, size=1), Match(a=1, b=2, size=2), Match(a=3, b=4, size=0)]

The number of matches is different

When isjunk is None: M = 2, T = 7 => ratio = 2.0 * 2 / 7

When isjunk is lambda x: x == ' ': M = 1 + 2, T = 6 => ratio = 2.0 * 3 / 7



来源:https://stackoverflow.com/questions/38129357/difflib-sequencematcher-isjunk-argument-not-considered

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!