How to correct bugs in this Damerau-Levenshtein implementation?

后端 未结 1 406
礼貌的吻别
礼貌的吻别 2021-01-03 10:55

I\'m back with another longish question. Having experimented with a number of Python-based Damerau-Levenshtein edit distance implementations, I finally found the one listed

相关标签:
1条回答
  • 2021-01-03 11:42

    Do some elementary debugging. You know that it is going wrong in the 2nd output line marked #ED B. The wrong values seem to indicate that it finds one edit early on and never finds any more. This is possibly because one of the min() args is somehow clamped at 1. Print deletion_cost, substitution_cost, addition_cost ... which is wrong? Why is it wrong? Print the input text values. Temporarily disable the transposition section to see if that makes the problem go away. Check and re-check the _warp caper (a tricksy hobbit gimmick if I ever saw one) and the usage thereof. What happens if you compare "aaaaa" with "aaaaa"? "qwerty" with "qwerty"? "xxxxx" with "yyyyy"? Does the problem happen with all of bytes, bytearray and str input?

    The free problem: I'd suspect corruption, not dizzyness. Print the three arrays; are their contents as expected? Try enabling the free() one array at a time -- all broken? only one? which one?

    Some asides on memory management: You may like to read this and consider using the Python-specific routines instead of malloc/free. Downsizing your array if there have been surrogates seems over the top.

    Update: Followed my own suggestions. Deletion cost was stuffed. "oneago" was same as "thisrow". Problem causing both the wrong answer and the doubled (-! not corrupted !-) free: circular shuffle of pointers wasn't circular.

    # twoago, oneago = oneago, thisrow ### BUG ###
    twoago, oneago, thisrow = oneago, thisrow, twoago ### FIXED ###
    

    Update 2: [comment capacity too small] No mojo, just plain ordinary debugging spadework, as I suggested. "concentrating on this for my fix" is not "super-readible". The reference code does create a new list for each pass, which it CAN do because thisrow refers to nothing carried over from the previous pass. It doesn't NEED to do this, and in fact the initialisation apart from the first and last elements could consist of random numbers, and are only there to fill out the list so that it can be indexed into instead of appended to as some non-tricksy implementations do. So you can slavishly emulate the "reference implementation", at the cost of doing an extra (wasted) malloc/free, or you could ignore the Python-specific implementation details and use the reference implementation solely as a source of presumably correct answers. Then you could accept my fix, and later go on to saving time by chopping out most of the initialisation of the thisrow array.

    Update 3: Here's a replacement reference implementation for you. It allocates 3 rows initially, in order to avoid the overhead of list creation inside the outer loop. It also avoids the unnecessary initialisation of all but the last element of thisrow. This eases the translation into C/Cython.

    def damlevref2(seq1, seq2):
        # For Python 2.x as was the original.
        # Appears to work on Python 1.5.2 as well :-)
        seq2len = len(seq2)
        twoago = [-777] * (seq2len + 1) # pseudo-malloc; any old rubbish will do
        oneago = [-666] * (seq2len + 1) # ditto
        thisrow = range(1, seq2len + 1) + [0]
        for x in xrange(len(seq1)):
            twoago, oneago, thisrow = oneago, thisrow, twoago # circular "pointer" shuffle
            thisrow[-1] = x + 1
            for y in xrange(seq2len):
                delcost = oneago[y] + 1
                addcost = thisrow[y - 1] + 1
                subcost = oneago[y - 1] + (seq1[x] != seq2[y])
                thisrow[y] = min(delcost, addcost, subcost)
                if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                    and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
                    thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
        return thisrow[seq2len - 1]    
    
    0 讨论(0)
自定义标题
段落格式
字体
字号
代码语言
提交回复
热议问题