问题
I'm looking to compute the the Levenshtein-distance between sequences containing up to 6 values. The order of these values should not affect the distance.
How would I implement this into the iterative or recursive algorithm?
Example:
# Currently
>>> LDistance('dog', 'god')
2
# Sorted
>>> LDistance('dgo', 'dgo')
0
# Proposed
>>> newLDistance('dog', 'god')
0
'dog' and 'god' have the exact same letters, sorting the strings before hand will return the desired result. However this doesn't work all the time:
# Currently
>>> LDistance('doge', 'gold')
3
# Sorted
>>> LDistance('dego', 'dglo')
2
# Proposed
>>> newLDistance('doge', 'gold')
1
'doge' and 'gold' have 3/4 matching letters and so should return a distance of 1. Here is my current recursive code:
def mLD(s, t):
memo = {}
def ld(s, t):
if not s: return len(t)
if not t: return len(s)
if s[0] == t[0]: return ld(s[1:], t[1:])
if (s, t) not in memo:
l1 = ld(s, t[1:])
l2 = ld(s[1:], t)
l3 = ld(s[1:], t[1:])
memo[(s,t)] = 1 + min(l1, l2, l3)
return memo[(s,t)]
return ld(s, t)
EDIT: Followup question: Adding exceptions to Levenshtein-Distance-like algorithm
回答1:
You don't need the Levenshtein machinery for this.
import collections
def distance(s1, s2):
cnt = collections.Counter()
for c in s1:
cnt[c] += 1
for c in s2:
cnt[c] -= 1
return sum(abs(diff) for diff in cnt.values()) // 2 + \
(abs(sum(cnt.values())) + 1) // 2 # can be omitted if len(s1) == len(s2)
回答2:
Why not just count how many letters are in common, and find and answer from this? For each character calculate its frequency, then for each string calculate how many "extra" characters it has based on frequencies, and take maximum of these "extra".
Pseudocode:
for c in s1:
cnt1[c]++
for c in s2:
cnt2[c]++
extra1 = 0
extra2 = 0
for c in all_chars:
if cnt1[c]>cnt2[c]
extra1 += cnt1[c]-cnt2[c]
else
extra2 += cnt2[c]-cnt1[c]
return max(extra1, extra2)
来源:https://stackoverflow.com/questions/32456943/modify-levenshtein-distance-to-ignore-order