Improving search result using Levenshtein distance in Java

前端 未结 5 1347
南方客
南方客 2021-01-31 03:08

I have following working Java code for searching for a word against a list of words and it works perfectly and as expected:

public class Levenshtein {
    privat         


        
5条回答
  •  时光取名叫无心
    2021-01-31 04:09

    Without understanding the meaning of the words like @DrYap suggests, the next logical unit to compare two words (if you are not looking for misspellings) is syllables. It is very easy to modify Levenshtein to compare syllables instead of characters. The hard part is breaking the words into syllables. There is a Java implementation TeXHyphenator-J which can be used to split the words. Based on this hyphenation library, here is a modified version of Levenshtein function written by Michael Gilleland & Chas Emerick. More about syllable detection here and here. Of course, you'll want to avoid syllable comparison of two single syllable words probably handling this case with standard Levenshtein.

    import net.davidashen.text.Hyphenator;
    
    public class WordDistance {
    
        public static void main(String args[]) throws Exception {
            Hyphenator h = new Hyphenator();
            h.loadTable(WordDistance.class.getResourceAsStream("hyphen.tex"));
            getSyllableLevenshteinDistance(h, args[0], args[1]);
        }
    
        /**
         * 

    * Calculate Syllable Levenshtein distance between two words

    * The Syllable Levenshtein distance is defined as the minimal number of * case-insensitive syllables you have to replace, insert or delete to transform word1 into word2. * @return int * @throws IllegalArgumentException if either str1 or str2 is null */ public static int getSyllableLevenshteinDistance(Hyphenator h, String s, String t) { if (s == null || t == null) throw new NullPointerException("Strings must not be null"); final String hyphen = Character.toString((char) 173); final String[] ss = h.hyphenate(s).split(hyphen); final String[] st = h.hyphenate(t).split(hyphen); final int n = ss.length; final int m = st.length; if (n == 0) return m; else if (m == 0) return n; int p[] = new int[n + 1]; // 'previous' cost array, horizontally int d[] = new int[n + 1]; // cost array, horizontally for (int i = 0; i <= n; i++) p[i] = i; for (int j = 1; j <= m; j++) { d[0] = j; for (int i = 1; i <= n; i++) { int cost = ss[i - 1].equalsIgnoreCase(st[j - 1]) ? 0 : 1; // minimum of cell to the left+1, to the top+1, diagonally left and up +cost d[i] = Math.min(Math.min(d[i - 1] + 1, p[i] + 1), p[i - 1] + cost); } // copy current distance counts to 'previous row' distance counts int[] _d = p; p = d; d = _d; } // our last action in the above loop was to switch d and p, so p now actually has the most recent cost counts return p[n]; } }

提交回复
热议问题