Implementing a simple Trie for efficient Levenshtein Distance calculation - Java

后端未结

关注

 11  770

UPDATE 3

Done. Below is the code that finally passed all of my tests. Again, this is modeled after Murilo Vasconcelo\'s modified version of Steve

相关标签:

11条回答

不思量自难忘°

2020-12-22 19:20
The function walk takes a testitem (for example a indexable string, or an array of characters) and a trie. A trie can be an object with two slots. One specifying the node of the trie, the other the children of that node. The children are tries as well. In python it would be something like:
```
class Trie(object):
    def __init__(self, node=None, children=[]):
        self.node = node
        self.children = children
```
Or in Lisp...
```
(defstruct trie (node nil) (children nil))
```
Now a trie looks something like this:
```
(trie #node None
      #children ((trie #node f
                       #children ((trie #node o
                                        #children ((trie #node o
                                                         #children None)))
                                  (trie #node u
                                        #children ((trie #node n
                                                         #children None)))))))
```
Now the internal function (which you also can write separately) takes the testitem, the children of the root node of the tree (of which the node value is None or whatever), and an initial distance set to 0.

Then we just recursively traverse both branches of the tree, starting left and then right.
0 讨论(0)
发布评论:

提交评论
- 加载中...
挽巷

2020-12-22 19:21

My intuition tells me that each TrieNode should store the String it represents and also references to letters of the alphabet, not necessarily all letters. Is my intuition correct?

No, a trie doesn't represent a String, it represents a set of strings (and all their prefixes). A trie node maps an input character to another trie node. So it should hold something like an array of characters and a corresponding array of TrieNode references. (Maybe not that exact representation, depending on efficiency in your particular use of it.)

0 讨论(0)
发布评论:

提交评论
- 加载中...
难免孤独

2020-12-22 19:27

Well, here's how I did it a long time ago. I stored the dictionary as a trie, which is simply a finite-state-machine restricted to the form of a tree. You can enhance it by not making that restriction. For example, common suffixes can simply be a shared subtree. You could even have loops, to capture stuff like "nation", "national", "nationalize", "nationalization", ...

Keep the trie as absolutely simple as possible. Don't go stuffing strings in it.

Remember, you don't do this to find the distance between two given strings. You use it to find the strings in the dictionary that are closest to one given string. The time it takes depends on how much levenshtein distance you can tolerate. For distance zero, it is simply O(n) where n is the word length. For arbitrary distance, it is O(N) where N is the number of words in the dictionary.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情话喂你

2020-12-22 19:28
Here is an example of Levenshtein Automata in Java (EDIT: moved to github).These will probably also be helpful:

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/ http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/util/automaton/

EDIT: The above links seem to have moved to github:

https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/util/automaton https://github.com/apache/lucene-solr/tree/master/lucene/core/src/test/org/apache/lucene/util/automaton

It looks like the experimental Lucene code is based off of the dk.brics.automaton package.

Usage appears to be something similar to below:
```
LevenshteinAutomata builder = new LevenshteinAutomata(s);
Automaton automata = builder.toAutomaton(n);
boolean result1 = BasicOperations.run(automata, "foo");
boolean result2 = BasicOperations.run(automata, "bar");
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

独厮守ぢ

2020-12-22 19:31

As I see it right, you want to loop over all branches of the trie. That's not that difficult using a recursive function. I'm using a trie as well in my k-nearest neighbor algorithm, using the same kind of function. I don't know Java, however but here's some pseudocode:

function walk (testitem trie)
   make an empty array results
   function compare (testitem children distance)
     if testitem = None
        place the distance and children into results
     else compare(testitem from second position, 
                  the sub-children of the first child in children,
                  if the first item of testitem is equal to that 
                  of the node of the first child of children 
                  add one to the distance (! non-destructive)
                  else just the distance)
        when there are any children left
             compare (testitem, the children without the first item,
                      distance)
    compare(testitem, children of root-node in trie, distance set to 0)
    return the results

Hope it helps.

0 讨论(0)

没有蜡笔的小新

2020-12-22 19:34

I'll just leave this here in case anyone is looking for yet another treatment of this problem:

http://code.google.com/p/oracleofwoodyallen/wiki/ApproximateStringMatching

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页