Improving search result using Levenshtein distance in Java

前端 未结 5 1343
南方客
南方客 2021-01-31 03:08

I have following working Java code for searching for a word against a list of words and it works perfectly and as expected:

public class Levenshtein {
    privat         


        
相关标签:
5条回答
  • 2021-01-31 03:52

    Since you asked, I'll show how the UMBC semantic network can do at this kind of thing. Not sure it's what you really want:

    import static java.lang.String.format;
    import static java.util.Comparator.comparingDouble;
    import static java.util.stream.Collectors.toMap;
    import static java.util.function.Function.identity;
    
    import java.util.Map.Entry;
    import java.io.BufferedReader;
    import java.io.IOException;
    import java.io.InputStreamReader;
    import java.net.HttpURLConnection;
    import java.net.URL;
    import java.util.Arrays;
    import java.util.regex.Pattern;
    
    public class SemanticSimilarity {
      private static final String GET_URL_FORMAT
          = "http://swoogle.umbc.edu/SimService/GetSimilarity?"
              + "operation=api&phrase1=%s&phrase2=%s";
      private static final Pattern VALID_WORD_PATTERN = Pattern.compile("\\w+");
      private static final String[] DICT = {
        "cat",
        "building",
        "girl",
        "ranch",
        "drawing",
        "wool",
        "gear",
        "question",
        "information",
        "tank" 
      };
    
      public static String httpGetLine(String urlToRead) throws IOException {
        URL url = new URL(urlToRead);
        HttpURLConnection conn = (HttpURLConnection) url.openConnection();
        conn.setRequestMethod("GET");
        try (BufferedReader reader = new BufferedReader(
            new InputStreamReader(conn.getInputStream()))) {
          return reader.readLine();
        }
      }
    
      public static double getSimilarity(String a, String b) {
        if (!VALID_WORD_PATTERN.matcher(a).matches()
            || !VALID_WORD_PATTERN.matcher(b).matches()) {
          throw new RuntimeException("Bad word");
        }
        try {
          return Double.parseDouble(httpGetLine(format(GET_URL_FORMAT, a, b)));
        } catch (IOException | NumberFormatException ex) {
          return -1.0;
        }
      }
    
      public static void test(String target) throws IOException {
        System.out.println("Target: " + target);
        Arrays.stream(DICT)
            .collect(toMap(identity(), word -> getSimilarity(target, word)))
            .entrySet().stream()
            .sorted((a, b) -> Double.compare(b.getValue(), a.getValue()))
            .forEach(System.out::println);
        System.out.println();
      }
    
      public static void main(String[] args) throws Exception {
        test("sheep");
        test("vehicle");
        test("house");
        test("data");
        test("girlfriend");
      }
    }
    

    The results are kind of fascinating:

    Target: sheep
    ranch=0.38563728
    cat=0.37816614
    wool=0.36558008
    question=0.047607
    girl=0.0388761
    information=0.027191084
    drawing=0.0039623436
    tank=0.0
    building=0.0
    gear=0.0
    
    Target: vehicle
    tank=0.65860236
    gear=0.2673374
    building=0.20197356
    cat=0.06057514
    information=0.041832563
    ranch=0.017701812
    question=0.017145569
    girl=0.010708235
    wool=0.0
    drawing=0.0
    
    Target: house
    building=1.0
    ranch=0.104496084
    tank=0.103863
    wool=0.059761923
    girl=0.056549154
    drawing=0.04310725
    cat=0.0418914
    gear=0.026439993
    information=0.020329408
    question=0.0012588014
    
    Target: data
    information=0.9924584
    question=0.03476312
    gear=0.029112043
    wool=0.019744944
    tank=0.014537057
    drawing=0.013742204
    ranch=0.0
    cat=0.0
    girl=0.0
    building=0.0
    
    Target: girlfriend
    girl=0.70060706
    ranch=0.11062875
    cat=0.09766617
    gear=0.04835723
    information=0.02449007
    wool=0.0
    question=0.0
    drawing=0.0
    tank=0.0
    building=0.0
    
    0 讨论(0)
  • 2021-01-31 03:55

    This really is an open-ended question, but I would suggest an alternative approach which uses for example the Smith-Waterman algorithm as described in this SO.

    Another (more light-weight) solution would be to use other distance/similarity metrics from NLP (e.g., Cosine similarity or Damerau–Levenshtein distance).

    0 讨论(0)
  • 2021-01-31 03:56

    I tried the suggestion from the comments about sorting the matches by the distance returned by Levenshtein algo, and it seems it does produce better results.

    (As I could not find how I could not find the Searcher class from your code, I took the liberty of using a different source of wordlist, Levenshtein implementation, and language.)

    Using the word list provided in Ubuntu, and Levenshtein algo implementation from - https://github.com/ztane/python-Levenshtein, I created a small script that asks for a word and prints all closest words and distance as tuple.

    Code - https://gist.github.com/atdaemon/9f59ad886c35024bdd28

    from Levenshtein import distance
    import os
    
    def read_dict() :
        with open('/usr/share/dict/words','r') as f : 
            for line in f :
                yield str(line).strip()
    
    inp = str(raw_input('Enter a word : '))
    
    wordlist = read_dict()
    matches = []
    for word in wordlist :
        dist = distance(inp,word)
        if dist < 3 :
            matches.append((dist,word))
    print os.linesep.join(map(str,sorted(matches)))
    

    Sample output -

    Enter a word : job
    (0, 'job')
    (1, 'Bob')
    (1, 'Job')
    (1, 'Rob')
    (1, 'bob')
    (1, 'cob')
    (1, 'fob')
    (1, 'gob')
    (1, 'hob')
    (1, 'jab')
    (1, 'jib')
    (1, 'jobs')
    (1, 'jog')
    (1, 'jot')
    (1, 'joy')
    (1, 'lob')
    (1, 'mob')
    (1, 'rob')
    (1, 'sob')
    ...
    
    Enter a word : checker
    (0, 'checker')
    (1, 'checked')
    (1, 'checkers')
    (2, 'Becker')
    (2, 'Decker')
    (2, 'cheaper')
    (2, 'cheater')
    (2, 'check')
    (2, "check's")
    (2, "checker's")
    (2, 'checkered')
    (2, 'checks')
    (2, 'checkup')
    (2, 'cheeked')
    (2, 'cheekier')
    (2, 'cheer')
    (2, 'chewer')
    (2, 'chewier')
    (2, 'chicer')
    (2, 'chicken')
    (2, 'chocked')
    (2, 'choker')
    (2, 'chucked')
    (2, 'cracker')
    (2, 'hacker')
    (2, 'heckler')
    (2, 'shocker')
    (2, 'thicker')
    (2, 'wrecker')
    
    0 讨论(0)
  • 2021-01-31 04:00

    You can modify Levenshtein Distance by adjusting the scoring when consecutive characters match.

    Whenever there are consecutive characters that match, the score can then be reduced thus making the search more relevent.

    eg : Lets say the factor by which we want to reduce score by is 10 then if in a word we find the substring "job" we can reduce the score by 10 when we encounter "j" furthur reduce it by (10 + 20) when we find the string "jo" and finally reduce the score by (10 + 20 + 30) when we find "job".

    I have written a c++ code below :

    #include <bits/stdc++.h>
    
    #define INF -10000000
    #define FACTOR 10
    
    using namespace std;
    
    double memo[100][100][100];
    
    double Levenshtein(string inputWord, string checkWord, int i, int j, int count){
        if(i == inputWord.length() && j == checkWord.length()) return 0;    
        if(i == inputWord.length()) return checkWord.length() - j;
        if(j == checkWord.length()) return inputWord.length() - i;
        if(memo[i][j][count] != INF) return memo[i][j][count];
    
        double ans1 = 0, ans2 = 0, ans3 = 0, ans = 0;
        if(inputWord[i] == checkWord[j]){
            ans1 = Levenshtein(inputWord, checkWord, i+1, j+1, count+1) - (FACTOR*(count+1));
            ans2 = Levenshtein(inputWord, checkWord, i+1, j, 0) + 1;
            ans3 = Levenshtein(inputWord, checkWord, i, j+1, 0) + 1;
            ans = min(ans1, min(ans2, ans3));
        }else{
            ans1 = Levenshtein(inputWord, checkWord, i+1, j, 0) + 1;
            ans2 = Levenshtein(inputWord, checkWord, i, j+1, 0) + 1;
            ans = min(ans1, ans2);
        }
        return memo[i][j][count] = ans;
    }
    
    int main(void) {
        // your code goes here
        string word = "job";
        string wordList[40];
        vector< pair <double, string> > ans;
        for(int i = 0;i < 40;i++){
            cin >> wordList[i];
            for(int j = 0;j < 100;j++) for(int k = 0;k < 100;k++){
                for(int m = 0;m < 100;m++) memo[j][k][m] = INF;
            }
            ans.push_back( make_pair(Levenshtein(word, wordList[i], 
                0, 0, 0), wordList[i]) );
        }
        sort(ans.begin(), ans.end());
        for(int i = 0;i < ans.size();i++){
            cout << ans[i].second << " " << ans[i].first << endl;
        }
        return 0;
    }
    

    Link to demo : http://ideone.com/4UtCX3

    Here the FACTOR is taken as 10, you can experiment with other words and choose the appropriate value.

    Also note that the complexity of the above Levenshtein Distance has also increased, it is now O(n^3) instead of O(n^2) as now we are also keeping track of the counter that counts how many consecutive characters we have encountered.

    You can further play with the score by increasing it gradually after you find some consecutive substring and then a mismatch, instead of the current way where we have a fixed score of 1 that is added to the overall score.

    Also in the above solution you can remove the strings that have score >=0 as they are not at all releavent you can also choose some other threshold for that to have a more accurate search.

    0 讨论(0)
  • 2021-01-31 04:09

    Without understanding the meaning of the words like @DrYap suggests, the next logical unit to compare two words (if you are not looking for misspellings) is syllables. It is very easy to modify Levenshtein to compare syllables instead of characters. The hard part is breaking the words into syllables. There is a Java implementation TeXHyphenator-J which can be used to split the words. Based on this hyphenation library, here is a modified version of Levenshtein function written by Michael Gilleland & Chas Emerick. More about syllable detection here and here. Of course, you'll want to avoid syllable comparison of two single syllable words probably handling this case with standard Levenshtein.

    import net.davidashen.text.Hyphenator;
    
    public class WordDistance {
    
        public static void main(String args[]) throws Exception {
            Hyphenator h = new Hyphenator();
            h.loadTable(WordDistance.class.getResourceAsStream("hyphen.tex"));
            getSyllableLevenshteinDistance(h, args[0], args[1]);
        }
    
        /**
         * <p>
         * Calculate Syllable Levenshtein distance between two words </p>
         * The Syllable Levenshtein distance is defined as the minimal number of
         * case-insensitive syllables you have to replace, insert or delete to transform word1 into word2.
         * @return int
         * @throws IllegalArgumentException if either str1 or str2 is <b>null</b>
         */
        public static int getSyllableLevenshteinDistance(Hyphenator h, String s, String t) {
            if (s == null || t == null)
                throw new NullPointerException("Strings must not be null");
    
            final String hyphen = Character.toString((char) 173);
            final String[] ss = h.hyphenate(s).split(hyphen);
            final String[] st = h.hyphenate(t).split(hyphen);
    
            final int n = ss.length;
            final int m = st.length;
    
            if (n == 0)
                return m;
            else if (m == 0)
                return n;
    
            int p[] = new int[n + 1]; // 'previous' cost array, horizontally
            int d[] = new int[n + 1]; // cost array, horizontally
    
            for (int i = 0; i <= n; i++)
                p[i] = i;
    
            for (int j = 1; j <= m; j++) {
                d[0] = j;
                for (int i = 1; i <= n; i++) {
                    int cost = ss[i - 1].equalsIgnoreCase(st[j - 1]) ? 0 : 1;
                    // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
                    d[i] = Math.min(Math.min(d[i - 1] + 1, p[i] + 1), p[i - 1] + cost);
                }
                // copy current distance counts to 'previous row' distance counts
                int[] _d = p;
                p = d;
                d = _d;
            }
    
            // our last action in the above loop was to switch d and p, so p now actually has the most recent cost counts
            return p[n];
        }
    
    }
    
    0 讨论(0)
提交回复
热议问题