Problems with Levenshtein algorithm in Java

后端 未结 5 667
情深已故
情深已故 2021-02-08 20:16

I want to use the Levenshtein algorithm for the following task: if a user on my website searches for some value (he enters characters in a input), I want to instantly check for

5条回答
  •  花落未央
    2021-02-08 20:55

    1) Few words about Levenshtein distance algorithm improvement

    Recursive implementation of Levenshteins distance has exponential complexity.

    I'd suggest you to use memoization technique and implement Levenshtein distance without recursion, and reduce complexity to O(N^2)(needs O(N^2) memory)

    public static int levenshteinDistance( String s1, String s2 ) {
        return dist( s1.toCharArray(), s2.toCharArray() );
    }
    
    public static int dist( char[] s1, char[] s2 ) {
    
        // distance matrix - to memoize distances between substrings
        // needed to avoid recursion
        int[][] d = new int[ s1.length + 1 ][ s2.length + 1 ];
    
        // d[i][j] - would contain distance between such substrings:
        // s1.subString(0, i) and s2.subString(0, j)
    
        for( int i = 0; i < s1.length + 1; i++ ) {
            d[ i ][ 0 ] = i;
        }
    
        for(int j = 0; j < s2.length + 1; j++) {
            d[ 0 ][ j ] = j;
        }
    
        for( int i = 1; i < s1.length + 1; i++ ) {
            for( int j = 1; j < s2.length + 1; j++ ) {
                int d1 = d[ i - 1 ][ j ] + 1;
                int d2 = d[ i ][ j - 1 ] + 1;
                int d3 = d[ i - 1 ][ j - 1 ];
                if ( s1[ i - 1 ] != s2[ j - 1 ] ) {
                    d3 += 1;
                }
                d[ i ][ j ] = Math.min( Math.min( d1, d2 ), d3 );
            }
        }
        return d[ s1.length ][ s2.length ];
    }
    

    Or, even better - you may notice, that for each cell in distance matrix - you're need only information about previous line, so you can reduce memory needs to O(N):

    public static int dist( char[] s1, char[] s2 ) {
    
        // memoize only previous line of distance matrix     
        int[] prev = new int[ s2.length + 1 ];
    
        for( int j = 0; j < s2.length + 1; j++ ) {
            prev[ j ] = j;
        }
    
        for( int i = 1; i < s1.length + 1; i++ ) {
    
            // calculate current line of distance matrix     
            int[] curr = new int[ s2.length + 1 ];
            curr[0] = i;
    
            for( int j = 1; j < s2.length + 1; j++ ) {
                int d1 = prev[ j ] + 1;
                int d2 = curr[ j - 1 ] + 1;
                int d3 = prev[ j - 1 ];
                if ( s1[ i - 1 ] != s2[ j - 1 ] ) {
                    d3 += 1;
                }
                curr[ j ] = Math.min( Math.min( d1, d2 ), d3 );
            }
    
            // define current line of distance matrix as previous     
            prev = curr;
        }
        return prev[ s2.length ];
    }
    

    2) Few words about autocomplete

    Levenshtein's distance is perferred only if you need to find exact matches.

    But what if your keyword would be apple and user typed green apples? Levenshteins distance between query and keyword would be large (7 points). And Levensteins distance between apple and bcdfghk (dumb string) would be 7 points too!

    I'd suggest you to use full-text search engine (e.g. Lucene). The trick is - that you have to use n-gram model to represent each keyword.

    In few words:
    1) you have to represent each keyword as document, which contains n-grams: apple -> [ap, pp, pl, le].

    2) after transforming each keyword to set of n-grams - you have to index each keyword-document by n-gram in your search engine. You'll have to create index like this:

    ...
    ap -> apple, map, happy ...
    pp -> apple ...
    pl -> apple, place ...
    ...
    

    3) So you have n-gram index. When you're get query - you have to split it into n-grams. Aftre this - you'll have set of users query n-grams. And all you need - is to match most similar documents from your search engine. In draft approach it would be enough.

    4) For better suggest - you may rank results of search-engine by Levenshtein distance.

    P.S. I'd suggest you to look through the book "Introduction to information retrieval".

提交回复
热议问题