问题
Based on this paper: IEEE TRANSACTIONS ON PAITERN ANALYSIS : Computation of Normalized Edit Distance and Applications In this paper Normalized Edit Distance as followed:
Given two strings X and Y over a finite alphabet, the normalized edit distance between X and Y, d( X , Y ) is defined as the minimum of W( P ) / L ( P )w, here P is an editing path between X and Y , W ( P ) is the sum of the weights of the elementary edit operations of P, and L(P) is the number of these operations (length of P).
Can i safely translate the normalized edit distance algorithm explained above as this:
normalized edit distance =
levenshtein(query 1, query 2)/max(length(query 1), length(query 2))
回答1:
You are probably misunderstanding the metric. There are two issues:
The normalization step is to divide
W(P)
which is the weight of the edit procedure overL(P)
, which is the length of the edit procedure, not over the max length of the strings as you did;Also, the paper showed that (Example 3.1) normalized edit distance cannot be simply computed with levenshtein distance. You probably need to implement their algorithm.
An explanation of Example 3.1 (c):
From aaab
to abbb
, the paper used the following transformations:
- match
a
witha
; - skip
a
in the first string; - skip
a
in the first string; - skip
b
in the second string; - skip
b
in the second string; - match the final
b
s.
These are 6 operations which is why L(P)
is 6; from the matrix in (a), matching has cost 0, skipping has cost 2, thus we have total cost of 0 + 2 + 2 + 2 + 2 + 0 = 8
, which is exactly W(P)
, and W(P) / L(P) = 1.33
. Similar results can be obtained for (b), which I'll left to you as exercise :-)
回答2:
The 3 in figure 2(a) refers to the cost of changing "a" to "b" or the cost of changing "b" to "a". The columns with lambdas in figure 2(a) mean that it costs 2 in order to insert or delete either an "a" or a "b".
In figure 2(b), W(P) = 6 because the algorithm does the following steps:
- keep first a (cost 0)
- convert first b to a (cost 3)
- convert second b to a (cost 3)
- keep last b (cost 0)
The sum of the costs of the steps is W(P). The number of steps is 4 which is L(P).
In figure 2(c), the steps are different:
- keep first a (cost 0)
- delete first b (cost 2)
- delete second b (cost 2)
- insert a (cost 2)
- insert a (cost 2)
- keep last b (cost 0)
In this path there are six steps so the L(P) is 6. The sum of the costs of the steps is 8 so W(P) is 8. Therefore the normalized edit distance is 8/6 = 4/3 which is about 1.33.
来源:https://stackoverflow.com/questions/30787098/explanation-of-normalized-edit-distance-formula