Modifying Levenshtein Distance algorithm to not calculate all distances

后端 未结 6 1760
渐次进展
渐次进展 2020-12-31 19:40

I\'m working on a fuzzy search implementation and as part of the implementation, we\'re using Apache\'s StringUtils.getLevenshteinDistance. At the moment, we\'re going for a

相关标签:
6条回答
  • 2020-12-31 19:53

    The issue with implementing the window is dealing with the value to the left of the first entry and above the last entry in each row.

    One way is to start the values you initially fill in at 1 instead of 0, then just ignore any 0s that you encounter. You'll have to subtract 1 from your final answer.

    Another way is to fill the entries left of first and above last with high values so the minimum check will never pick them. That's the way I chose when I had to implement it the other day:

    public static int levenshtein(String s, String t, int threshold) {
        int slen = s.length();
        int tlen = t.length();
    
        // swap so the smaller string is t; this reduces the memory usage
        // of our buffers
        if(tlen > slen) {
            String stmp = s;
            s = t;
            t = stmp;
            int itmp = slen;
            slen = tlen;
            tlen = itmp;
        }
    
        // p is the previous and d is the current distance array; dtmp is used in swaps
        int[] p = new int[tlen + 1];
        int[] d = new int[tlen + 1];
        int[] dtmp;
    
        // the values necessary for our threshold are written; the ones after
        // must be filled with large integers since the tailing member of the threshold 
        // window in the bottom array will run min across them
        int n = 0;
        for(; n < Math.min(p.length, threshold + 1); ++n)
            p[n] = n;
        Arrays.fill(p, n, p.length, Integer.MAX_VALUE);
        Arrays.fill(d, Integer.MAX_VALUE);
    
        // this is the core of the Levenshtein edit distance algorithm
        // instead of actually building the matrix, two arrays are swapped back and forth
        // the threshold limits the amount of entries that need to be computed if we're 
        // looking for a match within a set distance
        for(int row = 1; row < s.length()+1; ++row) {
            char schar = s.charAt(row-1);
            d[0] = row;
    
            // set up our threshold window
            int min = Math.max(1, row - threshold);
            int max = Math.min(d.length, row + threshold + 1);
    
            // since we're reusing arrays, we need to be sure to wipe the value left of the
            // starting index; we don't have to worry about the value above the ending index
            // as the arrays were initially filled with large integers and we progress to the right
            if(min > 1)
                d[min-1] = Integer.MAX_VALUE;
    
            for(int col = min; col < max; ++col) {
                if(schar == t.charAt(col-1))
                    d[col] = p[col-1];
                else 
                    // min of: diagonal, left, up
                    d[col] = Math.min(p[col-1], Math.min(d[col-1], p[col])) + 1;
            }
            // swap our arrays
            dtmp = p;
            p = d;
            d = dtmp;
        }
    
            if(p[tlen] == Integer.MAX_VALUE)
                return -1;
        return p[tlen];
    }
    
    0 讨论(0)
  • 2020-12-31 19:53

    I used the original code and places this just before the end of the j for loop:

        if (p[n] > s.length() + 5)
            break;
    

    The +5 is arbitrary but for our purposes, if the distances is the query length plus five (or whatever number we settle upon), it doesn't really matter what is returned because we consider the match as simply being too different. It does cut down on things a bit. Still, pretty sure this isn't the idea that the Wiki statement was talking about, if anyone understands that better.

    0 讨论(0)
  • 2020-12-31 19:57

    I've written about Levenshtein automata, which are one way to do this sort of check in O(n) time before, here. The source code samples are in Python, but the explanations should be helpful, and the referenced papers provide more details.

    0 讨论(0)
  • 2020-12-31 20:02

    Apache Commons Lang 3.4 has this implementation:

    /**
     * <p>Find the Levenshtein distance between two Strings if it's less than or equal to a given
     * threshold.</p>
     *
     * <p>This is the number of changes needed to change one String into
     * another, where each change is a single character modification (deletion,
     * insertion or substitution).</p>
     *
     * <p>This implementation follows from Algorithms on Strings, Trees and Sequences by Dan Gusfield
     * and Chas Emerick's implementation of the Levenshtein distance algorithm from
     * <a href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ld.htm</a></p>
     *
     * <pre>
     * StringUtils.getLevenshteinDistance(null, *, *)             = IllegalArgumentException
     * StringUtils.getLevenshteinDistance(*, null, *)             = IllegalArgumentException
     * StringUtils.getLevenshteinDistance(*, *, -1)               = IllegalArgumentException
     * StringUtils.getLevenshteinDistance("","", 0)               = 0
     * StringUtils.getLevenshteinDistance("aaapppp", "", 8)       = 7
     * StringUtils.getLevenshteinDistance("aaapppp", "", 7)       = 7
     * StringUtils.getLevenshteinDistance("aaapppp", "", 6))      = -1
     * StringUtils.getLevenshteinDistance("elephant", "hippo", 7) = 7
     * StringUtils.getLevenshteinDistance("elephant", "hippo", 6) = -1
     * StringUtils.getLevenshteinDistance("hippo", "elephant", 7) = 7
     * StringUtils.getLevenshteinDistance("hippo", "elephant", 6) = -1
     * </pre>
     *
     * @param s  the first String, must not be null
     * @param t  the second String, must not be null
     * @param threshold the target threshold, must not be negative
     * @return result distance, or {@code -1} if the distance would be greater than the threshold
     * @throws IllegalArgumentException if either String input {@code null} or negative threshold
     */
    public static int getLevenshteinDistance(CharSequence s, CharSequence t, final int threshold) {
        if (s == null || t == null) {
            throw new IllegalArgumentException("Strings must not be null");
        }
        if (threshold < 0) {
            throw new IllegalArgumentException("Threshold must not be negative");
        }
    
        /*
        This implementation only computes the distance if it's less than or equal to the
        threshold value, returning -1 if it's greater.  The advantage is performance: unbounded
        distance is O(nm), but a bound of k allows us to reduce it to O(km) time by only
        computing a diagonal stripe of width 2k + 1 of the cost table.
        It is also possible to use this to compute the unbounded Levenshtein distance by starting
        the threshold at 1 and doubling each time until the distance is found; this is O(dm), where
        d is the distance.
    
        One subtlety comes from needing to ignore entries on the border of our stripe
        eg.
        p[] = |#|#|#|*
        d[] =  *|#|#|#|
        We must ignore the entry to the left of the leftmost member
        We must ignore the entry above the rightmost member
    
        Another subtlety comes from our stripe running off the matrix if the strings aren't
        of the same size.  Since string s is always swapped to be the shorter of the two,
        the stripe will always run off to the upper right instead of the lower left of the matrix.
    
        As a concrete example, suppose s is of length 5, t is of length 7, and our threshold is 1.
        In this case we're going to walk a stripe of length 3.  The matrix would look like so:
    
           1 2 3 4 5
        1 |#|#| | | |
        2 |#|#|#| | |
        3 | |#|#|#| |
        4 | | |#|#|#|
        5 | | | |#|#|
        6 | | | | |#|
        7 | | | | | |
    
        Note how the stripe leads off the table as there is no possible way to turn a string of length 5
        into one of length 7 in edit distance of 1.
    
        Additionally, this implementation decreases memory usage by using two
        single-dimensional arrays and swapping them back and forth instead of allocating
        an entire n by m matrix.  This requires a few minor changes, such as immediately returning
        when it's detected that the stripe has run off the matrix and initially filling the arrays with
        large values so that entries we don't compute are ignored.
    
        See Algorithms on Strings, Trees and Sequences by Dan Gusfield for some discussion.
         */
    
        int n = s.length(); // length of s
        int m = t.length(); // length of t
    
        // if one string is empty, the edit distance is necessarily the length of the other
        if (n == 0) {
            return m <= threshold ? m : -1;
        } else if (m == 0) {
            return n <= threshold ? n : -1;
        }
    
        if (n > m) {
            // swap the two strings to consume less memory
            final CharSequence tmp = s;
            s = t;
            t = tmp;
            n = m;
            m = t.length();
        }
    
        int p[] = new int[n + 1]; // 'previous' cost array, horizontally
        int d[] = new int[n + 1]; // cost array, horizontally
        int _d[]; // placeholder to assist in swapping p and d
    
        // fill in starting table values
        final int boundary = Math.min(n, threshold) + 1;
        for (int i = 0; i < boundary; i++) {
            p[i] = i;
        }
        // these fills ensure that the value above the rightmost entry of our
        // stripe will be ignored in following loop iterations
        Arrays.fill(p, boundary, p.length, Integer.MAX_VALUE);
        Arrays.fill(d, Integer.MAX_VALUE);
    
        // iterates through t
        for (int j = 1; j <= m; j++) {
            final char t_j = t.charAt(j - 1); // jth character of t
            d[0] = j;
    
            // compute stripe indices, constrain to array size
            final int min = Math.max(1, j - threshold);
            final int max = (j > Integer.MAX_VALUE - threshold) ? n : Math.min(n, j + threshold);
    
            // the stripe may lead off of the table if s and t are of different sizes
            if (min > max) {
                return -1;
            }
    
            // ignore entry left of leftmost
            if (min > 1) {
                d[min - 1] = Integer.MAX_VALUE;
            }
    
            // iterates through [min, max] in s
            for (int i = min; i <= max; i++) {
                if (s.charAt(i - 1) == t_j) {
                    // diagonally left and up
                    d[i] = p[i - 1];
                } else {
                    // 1 + minimum of cell to the left, to the top, diagonally left and up
                    d[i] = 1 + Math.min(Math.min(d[i - 1], p[i]), p[i - 1]);
                }
            }
    
            // copy current distance counts to 'previous row' distance counts
            _d = p;
            p = d;
            d = _d;
        }
    
        // if p[n] is greater than the threshold, there's no guarantee on it being the correct
        // distance
        if (p[n] <= threshold) {
            return p[n];
        }
        return -1;
    }
    
    0 讨论(0)
  • 2020-12-31 20:15

    According to "Gusfield, Dan (1997). Algorithms on strings, trees, and sequences: computer science and computational biology" (page 264) you should ignore zeros.

    0 讨论(0)
  • 2020-12-31 20:15

    Here someone answers a very similar question:

    Cite:
    I've done it a number of times. The way I do it is with a recursive depth-first tree-walk of the game tree of possible changes. There is a budget k of changes, that I use to prune the tree. With that routine in hand, first I run it with k=0, then k=1, then k=2 until I either get a hit or I don't want to go any higher.

    char* a = /* string 1 */;
    char* b = /* string 2 */;
    int na = strlen(a);
    int nb = strlen(b);
    bool walk(int ia, int ib, int k){
      /* if the budget is exhausted, prune the search */
      if (k < 0) return false;
      /* if at end of both strings we have a match */ 
      if (ia == na && ib == nb) return true;
      /* if the first characters match, continue walking with no reduction in budget */
      if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
      /* if the first characters don't match, assume there is a 1-character replacement */
      if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
      /* try assuming there is an extra character in a */
      if (ia < na && walk(ia+1, ib, k-1)) return true;
      /* try assuming there is an extra character in b */
      if (ib < nb && walk(ia, ib+1, k-1)) return true;
      /* if none of those worked, I give up */
      return false;
    }  
    

    just the main part, more code in the original

    0 讨论(0)
提交回复
热议问题