I\'m back with another longish question. Having experimented with a number of Python-based Damerau-Levenshtein edit distance implementations, I finally found the one listed
Do some elementary debugging. You know that it is going wrong in the 2nd output line marked #ED B
. The wrong values seem to indicate that it finds one edit early on and never finds any more. This is possibly because one of the min()
args is somehow clamped at 1. Print deletion_cost
, substitution_cost
, addition_cost
... which is wrong? Why is it wrong? Print the input text values. Temporarily disable the transposition section to see if that makes the problem go away. Check and re-check the _warp
caper (a tricksy hobbit gimmick if I ever saw one) and the usage thereof. What happens if you compare "aaaaa" with "aaaaa"? "qwerty" with "qwerty"? "xxxxx" with "yyyyy"? Does the problem happen with all of bytes
, bytearray
and str
input?
The free problem: I'd suspect corruption, not dizzyness. Print the three arrays; are their contents as expected? Try enabling the free()
one array at a time -- all broken? only one? which one?
Some asides on memory management: You may like to read this and consider using the Python-specific routines instead of malloc/free. Downsizing your array if there have been surrogates seems over the top.
Update: Followed my own suggestions. Deletion cost was stuffed. "oneago" was same as "thisrow". Problem causing both the wrong answer and the doubled (-! not corrupted !-) free: circular shuffle of pointers wasn't circular.
# twoago, oneago = oneago, thisrow ### BUG ###
twoago, oneago, thisrow = oneago, thisrow, twoago ### FIXED ###
Update 2: [comment capacity too small] No mojo, just plain ordinary debugging spadework, as I suggested. "concentrating on this for my fix" is not "super-readible". The reference code does create a new list for each pass, which it CAN do because thisrow
refers to nothing carried over from the previous pass. It doesn't NEED to do this, and in fact the initialisation apart from the first and last elements could consist of random numbers, and are only there to fill out the list so that it can be indexed into instead of appended to as some non-tricksy implementations do. So you can slavishly emulate the "reference implementation", at the cost of doing an extra (wasted) malloc/free, or you could ignore the Python-specific implementation details and use the reference implementation solely as a source of presumably correct answers. Then you could accept my fix, and later go on to saving time by chopping out most of the initialisation of the thisrow
array.
Update 3: Here's a replacement reference implementation for you. It allocates 3 rows initially, in order to avoid the overhead of list creation inside the outer loop. It also avoids the unnecessary initialisation of all but the last element of thisrow
. This eases the translation into C/Cython.
def damlevref2(seq1, seq2):
# For Python 2.x as was the original.
# Appears to work on Python 1.5.2 as well :-)
seq2len = len(seq2)
twoago = [-777] * (seq2len + 1) # pseudo-malloc; any old rubbish will do
oneago = [-666] * (seq2len + 1) # ditto
thisrow = range(1, seq2len + 1) + [0]
for x in xrange(len(seq1)):
twoago, oneago, thisrow = oneago, thisrow, twoago # circular "pointer" shuffle
thisrow[-1] = x + 1
for y in xrange(seq2len):
delcost = oneago[y] + 1
addcost = thisrow[y - 1] + 1
subcost = oneago[y - 1] + (seq1[x] != seq2[y])
thisrow[y] = min(delcost, addcost, subcost)
if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
return thisrow[seq2len - 1]