Done. Below is the code that finally passed all of my tests. Again, this is modeled after Murilo Vasconcelo\'s modified version of Steve
Correct me if I am wrong but I believe your update3 has an extra loop which is unnecesary and makes the program much slower:
for (int i = 0; i < iWordLength; i++) {
traverseTrie(theTrie.root, word.get(i), word, currentRow);
}
You ought to call traverseTrie only once because within traverseTrie you are already looping over the whole word. The code should be only as follows:
traverseTrie(theTrie.root, ' ', word, currentRow);
From what I can tell you don't need to improve the efficiency of Levenshtein Distance, you need to store your strings in a structure that stops you needing to run distance computations so many times i.e by pruning the search space.
Since Levenshtein distance is a metric, you can use any of the metric spaces indices which take advantage of triangle inequality - you mentioned BK-Trees, but there are others eg. Vantage Point Trees, Fixed-Queries Trees, Bisector Trees, Spatial Approximation Trees. Here are their descriptions:
Burkhard-Keller Tree
Nodes are inserted into the tree as follows: For the root node pick an arbitary element from the space; add unique edge-labeled children such that the value of each edge is the distance from the pivot to that element; apply recursively, selecting the child as the pivot when an edge already exists.
Fixed-Queries Tree
As with BKTs except: Elements are stored at leaves; Each leaf has multiple elements; For each level of the tree the same pivot is used.
Bisector Tree
Each node contains two pivot elements with their covering radius (maximum distance between the centre element and any of its subtree elements); Filter into two sets those elements which are closest to the first pivot and those closest to the second, and recursively build two subtrees from these sets.
Spatial Approximation Tree
Initially all elements are in a bag; Choose an arbitrary element to be the pivot; Build a collection of nearest neighbours within range of the pivot; Put each remaining element into the bag of the nearest element to it from collection just built; Recursively form a subtree from each element of this collection.
Vantage Point Tree
Choose a pivot from the set abitrarily; Calculate the median distance between this pivot and each element of the remaining set; Filter elements from the set into left and right recursive subtrees such that those with distances less than or equal to the median form the left and those greater form the right.
In many ways, Steve Hanov's algorithm (presented in the first article linked in the question, Fast and Easy Levenshtein distance using a Trie), the ports of the algorithm made by Murilo and you (OP), and quite possibly every pertinent algorithm involving a Trie or similar structure, function much like a Levenshtein Automaton (which has been mentioned several times here) does:
Given:
dict is a dictionary represented as a DFA (ex. trie or dawg)
dictState is a state in dict
dictStartState is the start state in dict
dictAcceptState is a dictState arrived at after following the transitions defined by a word in dict
editDistance is an edit distance
laWord is a word
la is a Levenshtein Automaton defined for laWord and editDistance
laState is a state in la
laStartState is the start state in la
laAcceptState is a laState arrived at after following the transitions defined by a word that is within editDistance of laWord
charSequence is a sequence of chars
traversalDataStack is a stack of (dictState, laState, charSequence) tuples
Define dictState as dictStartState
Define laState as laStartState
Push (dictState, laState, "") on to traversalDataStack
While traversalDataStack is not empty
Define currentTraversalDataTuple as the the product of a pop of traversalDataStack
Define currentDictState as the dictState in currentTraversalDataTuple
Define currentLAState as the laState in currentTraversalDataTuple
Define currentCharSequence as the charSequence in currentTraversalDataTuple
For each char in alphabet
Check if currentDictState has outgoing transition labeled by char
Check if currentLAState has outgoing transition labeled by char
If both currentDictState and currentLAState have outgoing transitions labeled by char
Define newDictState as the state arrived at after following the outgoing transition of dictState labeled by char
Define newLAState as the state arrived at after following the outgoing transition of laState labeled by char
Define newCharSequence as concatenation of currentCharSequence and char
Push (newDictState, newLAState, newCharSequence) on to currentTraversalDataTuple
If newDictState is a dictAcceptState, and if newLAState is a laAcceptState
Add newCharSequence to resultSet
endIf
endIf
endFor
endWhile
Steve Hanov's algorithm and its aforementioned derivatives obviously use a Levenshtein distance computation matrix in place of a formal Levenshtein Automaton. Pretty fast, but a formal Levenshtein Automaton can have its parametric states (abstract states which describe the concrete states of the automaton) generated and used for traversal, bypassing any edit-distance-related runtime computation whatsoever. So, it should be run even faster than the aforementioned algorithms.
If you (or anybody else) is interested in a formal Levenshtein Automaton solution, have a look at LevenshteinAutomaton. It implements the aforementioned parametric-state-based algorithm, as well as a pure concrete-state-traversal-based algorithm (outlined above) and dynamic-programming-based algorithms (for both edit distance and neighbor determination). It's maintained by yours truly :) .
I've implemented the algo described on "Fast and Easy Levenshtein distance using a Trie" article in C++ and it is really fast. If you want (understand C++ better than Python), I can past the code in somewhere.
Edit: I posted it on my blog.
I was looking at your latest update 3, the algorithm seem not work well for me.
Let s see you have below test cases:
Trie dict = new Trie();
dict.insert("arb");
dict.insert("area");
ArrayList<Character> word = new ArrayList<Character>();
word.add('a');
word.add('r');
word.add('c');
In this case, the minimum edit distance between "arc"
and the dict should be 1, which is the edit distance between "arc"
and "arb"
, but you algorithms will return 2 instead.
I went through the below code piece:
if (word.get(i - 1) == letter) {
replaceCost = previousRow[i - 1];
} else {
replaceCost = previousRow[i - 1] + 1;
}
At least for the first loop, the letter is one of the characters in the word, but instead, you should be compare the nodes in the trie, so there will be one line duplicate with the first character in the word, is that right? each DP matrix has the first line as a duplicate. I executed the exact same code you put on the solution.