Implementing a simple Trie for efficient Levenshtein Distance calculation - Java

后端 未结 11 771
不思量自难忘°
不思量自难忘° 2020-12-22 18:51

UPDATE 3

Done. Below is the code that finally passed all of my tests. Again, this is modeled after Murilo Vasconcelo\'s modified version of Steve

相关标签:
11条回答
  • 2020-12-22 19:35

    Correct me if I am wrong but I believe your update3 has an extra loop which is unnecesary and makes the program much slower:

    for (int i = 0; i < iWordLength; i++) {
        traverseTrie(theTrie.root, word.get(i), word, currentRow);
    }
    

    You ought to call traverseTrie only once because within traverseTrie you are already looping over the whole word. The code should be only as follows:

    traverseTrie(theTrie.root, ' ', word, currentRow);
    
    0 讨论(0)
  • 2020-12-22 19:37

    From what I can tell you don't need to improve the efficiency of Levenshtein Distance, you need to store your strings in a structure that stops you needing to run distance computations so many times i.e by pruning the search space.

    Since Levenshtein distance is a metric, you can use any of the metric spaces indices which take advantage of triangle inequality - you mentioned BK-Trees, but there are others eg. Vantage Point Trees, Fixed-Queries Trees, Bisector Trees, Spatial Approximation Trees. Here are their descriptions:

    Burkhard-Keller Tree

    Nodes are inserted into the tree as follows: For the root node pick an arbitary element from the space; add unique edge-labeled children such that the value of each edge is the distance from the pivot to that element; apply recursively, selecting the child as the pivot when an edge already exists.

    Fixed-Queries Tree

    As with BKTs except: Elements are stored at leaves; Each leaf has multiple elements; For each level of the tree the same pivot is used.

    Bisector Tree

    Each node contains two pivot elements with their covering radius (maximum distance between the centre element and any of its subtree elements); Filter into two sets those elements which are closest to the first pivot and those closest to the second, and recursively build two subtrees from these sets.

    Spatial Approximation Tree

    Initially all elements are in a bag; Choose an arbitrary element to be the pivot; Build a collection of nearest neighbours within range of the pivot; Put each remaining element into the bag of the nearest element to it from collection just built; Recursively form a subtree from each element of this collection.

    Vantage Point Tree

    Choose a pivot from the set abitrarily; Calculate the median distance between this pivot and each element of the remaining set; Filter elements from the set into left and right recursive subtrees such that those with distances less than or equal to the median form the left and those greater form the right.

    0 讨论(0)
  • 2020-12-22 19:37

    In many ways, Steve Hanov's algorithm (presented in the first article linked in the question, Fast and Easy Levenshtein distance using a Trie), the ports of the algorithm made by Murilo and you (OP), and quite possibly every pertinent algorithm involving a Trie or similar structure, function much like a Levenshtein Automaton (which has been mentioned several times here) does:

    Given:
           dict is a dictionary represented as a DFA (ex. trie or dawg)
           dictState is a state in dict
           dictStartState is the start state in dict
           dictAcceptState is a dictState arrived at after following the transitions defined by a word in dict
           editDistance is an edit distance
           laWord is a word
           la is a Levenshtein Automaton defined for laWord and editDistance
           laState is a state in la
           laStartState is the start state in la
           laAcceptState is a laState arrived at after following the transitions defined by a word that is within editDistance of laWord
           charSequence is a sequence of chars
           traversalDataStack is a stack of (dictState, laState, charSequence) tuples
    
    Define dictState as dictStartState
    Define laState as laStartState
    Push (dictState, laState, "") on to traversalDataStack
    While traversalDataStack is not empty
        Define currentTraversalDataTuple as the the product of a pop of traversalDataStack
        Define currentDictState as the dictState in currentTraversalDataTuple
        Define currentLAState as the laState in currentTraversalDataTuple
        Define currentCharSequence as the charSequence in currentTraversalDataTuple
        For each char in alphabet
            Check if currentDictState has outgoing transition labeled by char
            Check if currentLAState has outgoing transition labeled by char
            If both currentDictState and currentLAState have outgoing transitions labeled by char
                Define newDictState as the state arrived at after following the outgoing transition of dictState labeled by char
                Define newLAState as the state arrived at after following the outgoing transition of laState labeled by char
                Define newCharSequence as concatenation of currentCharSequence and char
                Push (newDictState, newLAState, newCharSequence) on to currentTraversalDataTuple
                If newDictState is a dictAcceptState, and if newLAState is a laAcceptState
                    Add newCharSequence to resultSet
                endIf
            endIf
        endFor
    endWhile
    

    Steve Hanov's algorithm and its aforementioned derivatives obviously use a Levenshtein distance computation matrix in place of a formal Levenshtein Automaton. Pretty fast, but a formal Levenshtein Automaton can have its parametric states (abstract states which describe the concrete states of the automaton) generated and used for traversal, bypassing any edit-distance-related runtime computation whatsoever. So, it should be run even faster than the aforementioned algorithms.

    If you (or anybody else) is interested in a formal Levenshtein Automaton solution, have a look at LevenshteinAutomaton. It implements the aforementioned parametric-state-based algorithm, as well as a pure concrete-state-traversal-based algorithm (outlined above) and dynamic-programming-based algorithms (for both edit distance and neighbor determination). It's maintained by yours truly :) .

    0 讨论(0)
  • 2020-12-22 19:39

    I've implemented the algo described on "Fast and Easy Levenshtein distance using a Trie" article in C++ and it is really fast. If you want (understand C++ better than Python), I can past the code in somewhere.

    Edit: I posted it on my blog.

    0 讨论(0)
  • 2020-12-22 19:44

    I was looking at your latest update 3, the algorithm seem not work well for me.

    Let s see you have below test cases:

        Trie dict = new Trie();
        dict.insert("arb");
        dict.insert("area");
    
        ArrayList<Character> word = new ArrayList<Character>();
        word.add('a');
        word.add('r');
        word.add('c');
    

    In this case, the minimum edit distance between "arc" and the dict should be 1, which is the edit distance between "arc" and "arb", but you algorithms will return 2 instead.

    I went through the below code piece:

            if (word.get(i - 1) == letter) {
                replaceCost = previousRow[i - 1];
            } else {
                replaceCost = previousRow[i - 1] + 1;
            }
    

    At least for the first loop, the letter is one of the characters in the word, but instead, you should be compare the nodes in the trie, so there will be one line duplicate with the first character in the word, is that right? each DP matrix has the first line as a duplicate. I executed the exact same code you put on the solution.

    0 讨论(0)
提交回复
热议问题