Implementing a simple Trie for efficient Levenshtein Distance calculation - Java

后端未结

关注

 11  771

UPDATE 3

Done. Below is the code that finally passed all of my tests. Again, this is modeled after Murilo Vasconcelo\'s modified version of Steve

相关标签:

11条回答

借酒劲吻你

2020-12-22 19:35
Correct me if I am wrong but I believe your update3 has an extra loop which is unnecesary and makes the program much slower:
```
for (int i = 0; i < iWordLength; i++) {
    traverseTrie(theTrie.root, word.get(i), word, currentRow);
}
```
You ought to call traverseTrie only once because within traverseTrie you are already looping over the whole word. The code should be only as follows:
```
traverseTrie(theTrie.root, ' ', word, currentRow);
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
猫巷女王i

2020-12-22 19:37

From what I can tell you don't need to improve the efficiency of Levenshtein Distance, you need to store your strings in a structure that stops you needing to run distance computations so many times i.e by pruning the search space.

Since Levenshtein distance is a metric, you can use any of the metric spaces indices which take advantage of triangle inequality - you mentioned BK-Trees, but there are others eg. Vantage Point Trees, Fixed-Queries Trees, Bisector Trees, Spatial Approximation Trees. Here are their descriptions:

Burkhard-Keller Tree

Nodes are inserted into the tree as follows: For the root node pick an arbitary element from the space; add unique edge-labeled children such that the value of each edge is the distance from the pivot to that element; apply recursively, selecting the child as the pivot when an edge already exists.

Fixed-Queries Tree

As with BKTs except: Elements are stored at leaves; Each leaf has multiple elements; For each level of the tree the same pivot is used.

Bisector Tree

Each node contains two pivot elements with their covering radius (maximum distance between the centre element and any of its subtree elements); Filter into two sets those elements which are closest to the ﬁrst pivot and those closest to the second, and recursively build two subtrees from these sets.

Spatial Approximation Tree

Initially all elements are in a bag; Choose an arbitrary element to be the pivot; Build a collection of nearest neighbours within range of the pivot; Put each remaining element into the bag of the nearest element to it from collection just built; Recursively form a subtree from each element of this collection.

Vantage Point Tree

Choose a pivot from the set abitrarily; Calculate the median distance between this pivot and each element of the remaining set; Filter elements from the set into left and right recursive subtrees such that those with distances less than or equal to the median form the left and those greater form the right.

0 讨论(0)
发布评论:

提交评论
- 加载中...

隐瞒了意图╮

2020-12-22 19:37

In many ways, Steve Hanov's algorithm (presented in the first article linked in the question, Fast and Easy Levenshtein distance using a Trie), the ports of the algorithm made by Murilo and you (OP), and quite possibly every pertinent algorithm involving a Trie or similar structure, function much like a Levenshtein Automaton (which has been mentioned several times here) does:

Given:
       dict is a dictionary represented as a DFA (ex. trie or dawg)
       dictState is a state in dict
       dictStartState is the start state in dict
       dictAcceptState is a dictState arrived at after following the transitions defined by a word in dict
       editDistance is an edit distance
       laWord is a word
       la is a Levenshtein Automaton defined for laWord and editDistance
       laState is a state in la
       laStartState is the start state in la
       laAcceptState is a laState arrived at after following the transitions defined by a word that is within editDistance of laWord
       charSequence is a sequence of chars
       traversalDataStack is a stack of (dictState, laState, charSequence) tuples

Define dictState as dictStartState
Define laState as laStartState
Push (dictState, laState, "") on to traversalDataStack
While traversalDataStack is not empty
    Define currentTraversalDataTuple as the the product of a pop of traversalDataStack
    Define currentDictState as the dictState in currentTraversalDataTuple
    Define currentLAState as the laState in currentTraversalDataTuple
    Define currentCharSequence as the charSequence in currentTraversalDataTuple
    For each char in alphabet
        Check if currentDictState has outgoing transition labeled by char
        Check if currentLAState has outgoing transition labeled by char
        If both currentDictState and currentLAState have outgoing transitions labeled by char
            Define newDictState as the state arrived at after following the outgoing transition of dictState labeled by char
            Define newLAState as the state arrived at after following the outgoing transition of laState labeled by char
            Define newCharSequence as concatenation of currentCharSequence and char
            Push (newDictState, newLAState, newCharSequence) on to currentTraversalDataTuple
            If newDictState is a dictAcceptState, and if newLAState is a laAcceptState
                Add newCharSequence to resultSet
            endIf
        endIf
    endFor
endWhile

Steve Hanov's algorithm and its aforementioned derivatives obviously use a Levenshtein distance computation matrix in place of a formal Levenshtein Automaton. Pretty fast, but a formal Levenshtein Automaton can have its parametric states (abstract states which describe the concrete states of the automaton) generated and used for traversal, bypassing any edit-distance-related runtime computation whatsoever. So, it should be run even faster than the aforementioned algorithms.

If you (or anybody else) is interested in a formal Levenshtein Automaton solution, have a look at LevenshteinAutomaton. It implements the aforementioned parametric-state-based algorithm, as well as a pure concrete-state-traversal-based algorithm (outlined above) and dynamic-programming-based algorithms (for both edit distance and neighbor determination). It's maintained by yours truly :) .

0 讨论(0)

隐瞒了意图╮

2020-12-22 19:39

I've implemented the algo described on "Fast and Easy Levenshtein distance using a Trie" article in C++ and it is really fast. If you want (understand C++ better than Python), I can past the code in somewhere.

Edit: I posted it on my blog.

0 讨论(0)
发布评论:

提交评论
- 加载中...
逝去的感伤

2020-12-22 19:44
I was looking at your latest update 3, the algorithm seem not work well for me.

Let s see you have below test cases:
```
    Trie dict = new Trie();
    dict.insert("arb");
    dict.insert("area");

    ArrayList<Character> word = new ArrayList<Character>();
    word.add('a');
    word.add('r');
    word.add('c');
```
In this case, the minimum edit distance between "arc" and the dict should be 1, which is the edit distance between "arc" and "arb", but you algorithms will return 2 instead.

I went through the below code piece:
```
        if (word.get(i - 1) == letter) {
            replaceCost = previousRow[i - 1];
        } else {
            replaceCost = previousRow[i - 1] + 1;
        }
```
At least for the first loop, the letter is one of the characters in the word, but instead, you should be compare the nodes in the trie, so there will be one line duplicate with the first character in the word, is that right? each DP matrix has the first line as a duplicate. I executed the exact same code you put on the solution.
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2