Difference in normalization of Levenshtein (edit) distance?

随声附和 提交于 2021-01-27 05:37:15

问题


If the Levenshtein distance between two strings, s and t is given by L(s,t),

what is the difference in the impact on the resulting heuristic of the following two different normalization schemes?

  1. L(s,t) / [length(s) + length(t)]

  2. L(s,t) / max[length(s), length(t)]

  3. (L(s,t)*2) / [length(s) + length(t)]

I noticed that normalization approach 2 is recommended by the Levenshtein distance Wikipedia page but no mention is made of approach 1. Are both approaches equally valid? Just wondering if there is some mathematical justification for using one over the other.

Also, what is the difference between approach 1 and approach 3?

With the following example:

s = "Hi, my name is"

t = "Hello, my name is"

L(s,t) = 4

length(s) = 14 (includes white space)

length(t) = 17 (includes white space)

The Levenshtein distance given the three normalization algorithms above are:

  1. 4/(14+17) = 0.129

  2. 4/(17) = 0.235

  3. (4*2)/(14+17) = 0.258


回答1:


The effects of both variants should be nearly the same. The second term covers a range from zero (strings are equal) to one (completely different) while the upper range in the first variant depends on the length of the strings. If the lengths are nearly equal the upper bound is 0.5 and increases on larger differences between the lengths.



来源:https://stackoverflow.com/questions/41066394/difference-in-normalization-of-levenshtein-edit-distance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!