Implementing a simple Trie for efficient Levenshtein Distance calculation - Java

后端 未结 11 770
不思量自难忘°
不思量自难忘° 2020-12-22 18:51

UPDATE 3

Done. Below is the code that finally passed all of my tests. Again, this is modeled after Murilo Vasconcelo\'s modified version of Steve

相关标签:
11条回答
  • 2020-12-22 19:20

    The function walk takes a testitem (for example a indexable string, or an array of characters) and a trie. A trie can be an object with two slots. One specifying the node of the trie, the other the children of that node. The children are tries as well. In python it would be something like:

    class Trie(object):
        def __init__(self, node=None, children=[]):
            self.node = node
            self.children = children
    

    Or in Lisp...

    (defstruct trie (node nil) (children nil))
    

    Now a trie looks something like this:

    (trie #node None
          #children ((trie #node f
                           #children ((trie #node o
                                            #children ((trie #node o
                                                             #children None)))
                                      (trie #node u
                                            #children ((trie #node n
                                                             #children None)))))))
    

    Now the internal function (which you also can write separately) takes the testitem, the children of the root node of the tree (of which the node value is None or whatever), and an initial distance set to 0.

    Then we just recursively traverse both branches of the tree, starting left and then right.

    0 讨论(0)
  • 2020-12-22 19:21

    My intuition tells me that each TrieNode should store the String it represents and also references to letters of the alphabet, not necessarily all letters. Is my intuition correct?

    No, a trie doesn't represent a String, it represents a set of strings (and all their prefixes). A trie node maps an input character to another trie node. So it should hold something like an array of characters and a corresponding array of TrieNode references. (Maybe not that exact representation, depending on efficiency in your particular use of it.)

    0 讨论(0)
  • 2020-12-22 19:27

    Well, here's how I did it a long time ago. I stored the dictionary as a trie, which is simply a finite-state-machine restricted to the form of a tree. You can enhance it by not making that restriction. For example, common suffixes can simply be a shared subtree. You could even have loops, to capture stuff like "nation", "national", "nationalize", "nationalization", ...

    Keep the trie as absolutely simple as possible. Don't go stuffing strings in it.

    Remember, you don't do this to find the distance between two given strings. You use it to find the strings in the dictionary that are closest to one given string. The time it takes depends on how much levenshtein distance you can tolerate. For distance zero, it is simply O(n) where n is the word length. For arbitrary distance, it is O(N) where N is the number of words in the dictionary.

    0 讨论(0)
  • 2020-12-22 19:28

    Here is an example of Levenshtein Automata in Java (EDIT: moved to github).These will probably also be helpful:

    http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/ http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/util/automaton/

    EDIT: The above links seem to have moved to github:

    https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/util/automaton https://github.com/apache/lucene-solr/tree/master/lucene/core/src/test/org/apache/lucene/util/automaton

    It looks like the experimental Lucene code is based off of the dk.brics.automaton package.

    Usage appears to be something similar to below:

    LevenshteinAutomata builder = new LevenshteinAutomata(s);
    Automaton automata = builder.toAutomaton(n);
    boolean result1 = BasicOperations.run(automata, "foo");
    boolean result2 = BasicOperations.run(automata, "bar");
    
    0 讨论(0)
  • 2020-12-22 19:31

    As I see it right, you want to loop over all branches of the trie. That's not that difficult using a recursive function. I'm using a trie as well in my k-nearest neighbor algorithm, using the same kind of function. I don't know Java, however but here's some pseudocode:

    function walk (testitem trie)
       make an empty array results
       function compare (testitem children distance)
         if testitem = None
            place the distance and children into results
         else compare(testitem from second position, 
                      the sub-children of the first child in children,
                      if the first item of testitem is equal to that 
                      of the node of the first child of children 
                      add one to the distance (! non-destructive)
                      else just the distance)
            when there are any children left
                 compare (testitem, the children without the first item,
                          distance)
        compare(testitem, children of root-node in trie, distance set to 0)
        return the results
    

    Hope it helps.

    0 讨论(0)
  • 2020-12-22 19:34

    I'll just leave this here in case anyone is looking for yet another treatment of this problem:

    http://code.google.com/p/oracleofwoodyallen/wiki/ApproximateStringMatching

    0 讨论(0)
提交回复
热议问题