Levenshtein DFA in .NET

前端 未结 6 1493
广开言路
广开言路 2021-02-04 18:43

Good afternoon,

Does anyone know of an \"out-of-the-box\" implementation of Levenshtein DFA (deterministic finite automata) in .NET (or easily translatable to i

相关标签:
6条回答
  • 2021-02-04 19:03

    I'd just like to point out that as of now, the Levenshtein Automaton implementations in both Lucene and Lucene.Net make use of files containing parametric state tables (tables of abstract states which describe the concrete states in an automaton) created using Moman.

    If you want a solution capable of constructing such tables from scratch in memory, you might want to have a look at LevenshteinAutomaton. It's in Java, but it is well-structured, easy to follow, and extensively commented, and as such should be easier to port to C# than the current Lucene implementation. It is also maintained by moi.

    * Fun fact: I submitted LevenshteinAutomaton as a replacement, or as a reference for a replacement, to the current Levenshthein Automaton implementation in Lucene... 3 years ago.

    0 讨论(0)
  • 2021-02-04 19:07

    Here you go.

    /// <summary>
    /// Levenshtein Distance Calculator
    /// </summary>
    public static int DistanceFrom(this string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];
    
        // Step 1
        if (n == 0)
            return m;
    
        if (m == 0)
            return n;
    
        // Step 2
        for(int i = 0; i <= n; d[i, 0] = i++) ;
        for(int j = 0; j <= m; d[0, j] = j++) ;
    
        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
    
                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }
    
    0 讨论(0)
  • 2021-02-04 19:08

    Nick Johnson has a very detailed blog post about the construction of a Levenshtein automaton in Python, and the code is here. It is a good read, and I have used a slightly modified version of the code that I found efficient.

    The answer of Mike Dunlavey is good too. I wonder what is the most efficient in this case, a trie search or a Levenshtein DFA?

    0 讨论(0)
  • 2021-02-04 19:14

    We implemented this for apache lucene java, perhaps you could convert it to C# and save yourself time.

    the main class is here: its just a builder to get Levenshtein DFAs from a string, using the Schulz and Mihov algorithm.

    http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/LevenshteinAutomata.java

    the parametric descriptions (the precomputed tables) for Lev1 and Lev2 are here: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev1ParametricDescription.java

    http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev2ParametricDescription.java

    you might notice these are generated with a computer, we generated them with this script, using Jean-Phillipe Barrette's great moman implementation (python) http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/createLevAutomata.py

    we generate the parametric descriptions as packed long[] arrays so that it won't make our jar file too large.

    just modify the toAutomaton(int n) to fit your needs/DFA package. in our case we are using a modified form of the brics automaton package, where transitions are represented as unicode codepoint ranges.

    efficient unit tests are difficult for this sort of thing, but here is what we came up with... it seems to be thorough and even found a bug (which was fixed immediately by the author!) in the moman implementation.

    http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/util/automaton/TestLevenshteinAutomata.java

    0 讨论(0)
  • 2021-02-04 19:18

    I ported the relevant Lucene Java code as suggested by Robert Muir to C#. As far as the question goes and "out of the box": it is a work in progress but the code appears¹ to work and can probably be optimized² further, although it performs very well indeed.

    You can find it here: https://github.com/mjvh80/LevenshteinDFA/ .

    UPDATE: It appears that Lucene.NET is not in fact dead (yet?) and I noticed they now have a ported version of this code too. I would thus recommend looking there (https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Core/Util/Automaton/LevenshteinAutomata.cs) for an implementation of this.


    ¹ the code needs more tests
    ² because it's java ported to C# perhaps and because I wrote naive replacements of some classes (e.g. bitset).

    0 讨论(0)
  • 2021-02-04 19:23

    I understand you want to find near matches in a big dictionary. Here's the way I do it. link.

    From what I'm able to figure out about DFA, I can't see how it's any better, or even actually any different, under the skin. NFAs might be faster, but that's because they don't exist. Maybe I'm wrong.

    0 讨论(0)
提交回复
热议问题