Modify Levenshtein-Distance to ignore order

岁酱吖の 提交于 2019-12-10 15:08:12

问题


I'm looking to compute the the Levenshtein-distance between sequences containing up to 6 values. The order of these values should not affect the distance.

How would I implement this into the iterative or recursive algorithm?

Example:

# Currently 
>>> LDistance('dog', 'god')
2

# Sorted
>>> LDistance('dgo', 'dgo')
0

# Proposed
>>> newLDistance('dog', 'god')
0

'dog' and 'god' have the exact same letters, sorting the strings before hand will return the desired result. However this doesn't work all the time:

# Currently 
>>> LDistance('doge', 'gold')
3

# Sorted
>>> LDistance('dego', 'dglo')
2

# Proposed
>>> newLDistance('doge', 'gold')
1

'doge' and 'gold' have 3/4 matching letters and so should return a distance of 1. Here is my current recursive code:

def mLD(s, t):
    memo = {}
    def ld(s, t):
        if not s: return len(t)
        if not t: return len(s)
        if s[0] == t[0]: return ld(s[1:], t[1:])
        if (s, t) not in memo:
            l1 = ld(s, t[1:])
            l2 = ld(s[1:], t)
            l3 = ld(s[1:], t[1:])
            memo[(s,t)] = 1 + min(l1, l2, l3)
        return memo[(s,t)]
    return ld(s, t)

EDIT: Followup question: Adding exceptions to Levenshtein-Distance-like algorithm


回答1:


You don't need the Levenshtein machinery for this.

import collections
def distance(s1, s2):
    cnt = collections.Counter()
    for c in s1:
        cnt[c] += 1
    for c in s2:
        cnt[c] -= 1
    return sum(abs(diff) for diff in cnt.values()) // 2 + \
        (abs(sum(cnt.values())) + 1) // 2   # can be omitted if len(s1) == len(s2)



回答2:


Why not just count how many letters are in common, and find and answer from this? For each character calculate its frequency, then for each string calculate how many "extra" characters it has based on frequencies, and take maximum of these "extra".

Pseudocode:

for c in s1:
    cnt1[c]++
for c in s2:
    cnt2[c]++
extra1 = 0
extra2 = 0
for c in all_chars:
    if cnt1[c]>cnt2[c]
        extra1 += cnt1[c]-cnt2[c]
    else
        extra2 += cnt2[c]-cnt1[c]
return max(extra1, extra2)


来源:https://stackoverflow.com/questions/32456943/modify-levenshtein-distance-to-ignore-order

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!