Done. Below is the code that finally passed all of my tests. Again, this is modeled after Murilo Vasconcelo\'s modified version of Steve
The function walk takes a testitem (for example a indexable string, or an array of characters) and a trie. A trie can be an object with two slots. One specifying the node of the trie, the other the children of that node. The children are tries as well. In python it would be something like:
class Trie(object):
def __init__(self, node=None, children=[]):
self.node = node
self.children = children
Or in Lisp...
(defstruct trie (node nil) (children nil))
Now a trie looks something like this:
(trie #node None
#children ((trie #node f
#children ((trie #node o
#children ((trie #node o
#children None)))
(trie #node u
#children ((trie #node n
#children None)))))))
Now the internal function (which you also can write separately) takes the testitem, the children of the root node of the tree (of which the node value is None or whatever), and an initial distance set to 0.
Then we just recursively traverse both branches of the tree, starting left and then right.
My intuition tells me that each TrieNode should store the String it represents and also references to letters of the alphabet, not necessarily all letters. Is my intuition correct?
No, a trie doesn't represent a String, it represents a set of strings (and all their prefixes). A trie node maps an input character to another trie node. So it should hold something like an array of characters and a corresponding array of TrieNode references. (Maybe not that exact representation, depending on efficiency in your particular use of it.)
Well, here's how I did it a long time ago. I stored the dictionary as a trie, which is simply a finite-state-machine restricted to the form of a tree. You can enhance it by not making that restriction. For example, common suffixes can simply be a shared subtree. You could even have loops, to capture stuff like "nation", "national", "nationalize", "nationalization", ...
Keep the trie as absolutely simple as possible. Don't go stuffing strings in it.
Remember, you don't do this to find the distance between two given strings. You use it to find the strings in the dictionary that are closest to one given string. The time it takes depends on how much levenshtein distance you can tolerate. For distance zero, it is simply O(n) where n is the word length. For arbitrary distance, it is O(N) where N is the number of words in the dictionary.
Here is an example of Levenshtein Automata in Java (EDIT: moved to github).These will probably also be helpful:
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/ http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/util/automaton/
EDIT: The above links seem to have moved to github:
https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/util/automaton https://github.com/apache/lucene-solr/tree/master/lucene/core/src/test/org/apache/lucene/util/automaton
It looks like the experimental Lucene code is based off of the dk.brics.automaton package.
Usage appears to be something similar to below:
LevenshteinAutomata builder = new LevenshteinAutomata(s);
Automaton automata = builder.toAutomaton(n);
boolean result1 = BasicOperations.run(automata, "foo");
boolean result2 = BasicOperations.run(automata, "bar");
As I see it right, you want to loop over all branches of the trie. That's not that difficult using a recursive function. I'm using a trie as well in my k-nearest neighbor algorithm, using the same kind of function. I don't know Java, however but here's some pseudocode:
function walk (testitem trie)
make an empty array results
function compare (testitem children distance)
if testitem = None
place the distance and children into results
else compare(testitem from second position,
the sub-children of the first child in children,
if the first item of testitem is equal to that
of the node of the first child of children
add one to the distance (! non-destructive)
else just the distance)
when there are any children left
compare (testitem, the children without the first item,
distance)
compare(testitem, children of root-node in trie, distance set to 0)
return the results
Hope it helps.
I'll just leave this here in case anyone is looking for yet another treatment of this problem:
http://code.google.com/p/oracleofwoodyallen/wiki/ApproximateStringMatching