I know there are similar answer to this on stack, as well as online, but I feel I\'m missing something. Given the code below, we need to reconstruct the sequence of events t
It's my opinion that understanding the algorithm more deeply is important in this case. Rather than giving you some pseudocode, I'll walk you through the essential steps of the algorithm, and show you how the data you want is "encoded" in the final matrix that results. Of course, if you don't need to roll your own algorithm, then you should obviously just use someone else's, as MattH suggests!
This looks to me like an implementation of the Wagner-Fischer algorithm. The basic idea is to calculate the distances between "nearby" prefixes, take the minimum, and then calculate the distance for the current pair of strings from that. So for example, say you have two strings 'i'
and 'h'
. Let's lay them out along the vertical and horizontal axes of a matrix, like so:
_ h
_ 0 1
i 1 1
Here, '_'
denotes an empty string, and each cell in the matrix corresponds to an edit sequence that takes an input (''
or 'i'
) to an output (''
or 'h'
).
The distance from the empty string to any string of length L is L, (requiring L insertions). The distance from any string of length L to the empty string is also L (requiring L deletions). That covers the values in the first row and column, which simply increment.
From there, you can calculate the value of any location by taking the minimum from among the upper, left, and upper-left values, and adding one, or, if the letter is the same at that point in the string, taking the upper-left value unchanged. For the value at (1, 1)
in the table above, the minimum is 0
at (0, 0)
, so the value at (1, 1)
is 1
, and that's the minimum edit distance from 'i'
to 'h'
(one substitution). So in general, the minimum edit distance is always in the lower right corner of the matrix.
Now let's do another, comparing is
to hi
. Here again, each cell in the matrix corresponds to an edit sequence that takes an input (''
, 'i'
, or 'is'
) to an output (''
, 'h'
, or 'hi'
).
_ h i
_ 0 1 2
i 1 1 #
s 2 # #
We begin by enlarging the matrix, using #
as a placeholder for values we don't know yet, and extending the first row and column by incrementing. Having done so, we can begin calculating results for positions marked #
above. Let's start at (2, 1)
(in (row, column), i.e. row-major notation). Among the upper, upper-left, and left values, the minimum is 1
. The corresponding letters in the table are different -- s
and h
-- so we add one to that minimum value to get 2
, and carry on.
_ h i
_ 0 1 2
i 1 1 #
s 2 2 #
Let's move on to the value at (1, 2)
. Now things go a little differently because the corresponding letters in the table are the same -- they're both i
. This means we have the option of taking the value in the upper-left cell without adding one. The guiding intuition here is that we don't have to increase the count because the same letter is being added to both strings at this position. And since the lengths of both strings have increased by one, we move diagonally.
_ h i
_ 0 1 2
i 1 1 1
s 2 2 #
With the last empty cell, things go back to normal. The corresponding letters are s
and i
, and so we again take the minimum value and add one, to get 2
:
_ h i
_ 0 1 2
i 1 1 1
s 2 2 2
Here's the table we get if we continue this process for two longer words that start with is
and hi
-- isnt
(ignoring punctuation) and hint
:
_ h i n t
_ 0 1 2 3 4
i 1 1 1 2 3
s 2 2 2 2 3
n 3 3 3 2 3
t 4 4 4 3 2
This matrix is slightly more complex, but the final minimum edit distance here is still just 2
, because the last two letters of these two strings are the same. Convenient!
So how can we extract the types of edits from this table? The key is to realize that movement on the table corresponds to particular types of edits. So for example, a rightward movement from (0, 0)
to (0, 1)
takes us from _ -> _
, requiring no edits, to _ -> h
, requiring one edit, an insertion. Likewise, a downward movement from (0, 0)
to (1, 0)
takes us from _ -> _
, requiring no edits, to i -> _
, requiring one edit, a deletion. And finally, a diagonal movement from (0, 0)
to (1, 1)
takes us from _ -> _
, requiring no edits, to i -> h
, requiring one edit, a substitution.
So now all we have to do is reverse our steps, tracing local minima from among the upper, left, and upper-left cells back to the origin, (0, 0)
, keeping in mind that if the current value is the same as the minimum, then we must go to the upper-left cell, since that's the only kind of movement that doesn't increment the edit distance.
Here is a detailed description of the steps you could take to do so. Starting from the lower-right corner of the completed matrix, repeat the following until you reach the upper-left corner:
Equal
). No edit was required in this case because the characters at this location are the same.In the example above, there are two possible paths:
(4, 4) -> (3, 3) -> (2, 2) -> (1, 2) -> (0, 1) -> (0, 0)
and
(4, 4) -> (3, 3) -> (2, 2) -> (1, 1) -> (0, 0)
Reversing them, we get
(0, 0) -> (0, 1) -> (1, 2) -> (2, 2) -> (3, 3) -> (4, 4)
and
(0, 0) -> (1, 1) -> (2, 2) -> (3, 3) -> (4, 4)
So for the first version, our first operation is a movement to the right, i.e. an insertion. The letter inserted is h
, since we're moving from isnt
to hint
. (This corresponds to Insert, h
in your verbose output.) Our next operation is a diagonal movement, i.e. either a substitution, or a no-op. In this case, it's a no-op because the edit distance is the same at both locations (i.e. the letter is the same). So Equal, i, i
. Then a downward movement, corresponding to a deletion. The letter deleted is s
, since again, we're moving from isnt
to hint
. (In general, the letter to insert comes from the output string, while the letter to delete comes from the input string.) So that's Delete, s
. Then two diagonal movements with no change in value: Equal, n, n
and Equal, t, t
.
The result:
Insert, h
Equal, i, i
Delete, s
Equal, n, n
Equal, t, t
Performing these instructions on isnt
:
isnt (No change)
hisnt (Insertion)
hisnt (No change)
hint (Deletion)
hint (No change)
hint (No change)
For a total edit distance of 2.
I'll leave the second minimum path as an exercise. Keep in mind that both paths are completely equivalent; they may be different, but they will result in the same minimum edit distance of 2, and so are entirely interchangeable. At any point as you work backwards through the matrix, if you see two different possible local minima, you may take either one, and the final result is guaranteed to be correct
Once you grok all this, it shouldn't be hard to code at all. The key, in cases like this, is to deeply understand the algorithm first. Once you've done that, coding it up is a cinch.
As a final note, you might chose to accumulate the edits as you populate the matrix. In that case, each cell in your matrix could be a tuple: (2, ('ins', 'eq', 'del', 'eq', 'eq'))
. You would increment the length, and append the operation corresponding to a movement from the minimal previous state. That does away with the backtracking, and so decreases the complexity of the code; but it takes up extra memory. If you do this, the final edit sequence will appear along with the final edit distance in the lower right corner of the matrix.
I suggest you have a look at the python-Levenshtein module. Will probably get you a long way there:
>>> import Levenshtein
>>> Levenshtein.editops('LEAD','LAST')
[('replace', 1, 1), ('replace', 2, 2), ('replace', 3, 3)]
You can process the output from edit ops to create your verbose instructions.
I don't know python, but the following C# code works if that's any help.
public class EditDistanceCalculator
{
public double SubstitutionCost { get; private set; }
public double DeletionCost { get; private set; }
public double InsertionCost { get; private set; }
public EditDistanceCalculator() : this(1,1, 1)
{
}
public EditDistanceCalculator(double substitutionCost, double insertionCost, double deletionCost)
{
InsertionCost = insertionCost;
DeletionCost = deletionCost;
SubstitutionCost = substitutionCost;
}
public Move[] CalcEditDistance(string s, string t)
{
if (s == null) throw new ArgumentNullException("s");
if (t == null) throw new ArgumentNullException("t");
var distances = new Cell[s.Length + 1, t.Length + 1];
for (int i = 0; i <= s.Length; i++)
distances[i, 0] = new Cell(i, Move.Delete);
for (int j = 0; j <= t.Length; j++)
distances[0, j] = new Cell(j, Move.Insert);
for (int i = 1; i <= s.Length; i++)
for (int j = 1; j <= t.Length; j++)
distances[i, j] = CalcEditDistance(distances, s, t, i, j);
return GetEdit(distances, s.Length, t.Length);
}
private Cell CalcEditDistance(Cell[,] distances, string s, string t, int i, int j)
{
var cell = s[i - 1] == t[j - 1]
? new Cell(distances[i - 1, j - 1].Cost, Move.Match)
: new Cell(SubstitutionCost + distances[i - 1, j - 1].Cost, Move.Substitute);
double deletionCost = DeletionCost + distances[i - 1, j].Cost;
if (deletionCost < cell.Cost)
cell = new Cell(deletionCost, Move.Delete);
double insertionCost = InsertionCost + distances[i, j - 1].Cost;
if (insertionCost < cell.Cost)
cell = new Cell(insertionCost, Move.Insert);
return cell;
}
private static Move[] GetEdit(Cell[,] distances, int i, int j)
{
var moves = new Stack<Move>();
while (i > 0 && j > 0)
{
var move = distances[i, j].Move;
moves.Push(move);
switch (move)
{
case Move.Match:
case Move.Substitute:
i--;
j--;
break;
case Move.Insert:
j--;
break;
case Move.Delete:
i--;
break;
default:
throw new ArgumentOutOfRangeException();
}
}
for (int k = 0; k < i; k++)
moves.Push(Move.Delete);
for (int k = 0; k < j; k++)
moves.Push(Move.Insert);
return moves.ToArray();
}
class Cell
{
public double Cost { get; private set; }
public Move Move { get; private set; }
public Cell(double cost, Move move)
{
Cost = cost;
Move = move;
}
}
}
public enum Move
{
Match,
Substitute,
Insert,
Delete
}
Some tests:
[TestMethod]
public void TestEditDistance()
{
var expected = new[]
{
Move.Delete,
Move.Substitute,
Move.Match,
Move.Match,
Move.Match,
Move.Match,
Move.Match,
Move.Insert,
Move.Substitute,
Move.Match,
Move.Substitute,
Move.Match,
Move.Match,
Move.Match,
Move.Match
};
Assert.IsTrue(expected.SequenceEqual(new EditDistanceCalculator().CalcEditDistance("thou-shalt-not", "you-should-not")));
var calc = new EditDistanceCalculator(3, 1, 1);
var edit = calc.CalcEditDistance("democrat", "republican");
Console.WriteLine(string.Join(",", edit));
Assert.AreEqual(3, edit.Count(m => m == Move.Match)); //eca
}