Similarity Score - Levenshtein

后端 未结 6 1690
有刺的猬
有刺的猬 2020-12-05 10:31

I implemented the Levenshtein algorithm in Java and am now getting the corrections made by the algorithm, a.k.a. the cost. This does help a little but not much since I want

相关标签:
6条回答
  • 2020-12-05 11:01

    To calculate score, you need max possible cost(insert+drop+substitute). Then use below formula -

    score = 1 - actual_cost/max_possible_cost
    

    See this for reference - Levenshtein Score Calculation Func

    0 讨论(0)
  • 2020-12-05 11:03

    The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. (Wikipedia)

    • So a Levenshtein distance of 0 means: both strings are equal
    • The maximum Levenshtein distance (all chars are different) is max(string1.length, string2.length)

    So if you need a percentage, you have to use this to points to scale. For example:

    "Hallo", "Hello" -> Levenstein distance 1 Max Levenstein distance for this two strings is: 5. So the 20% of the characters do not match.

    String s1 = "Hallo";
    String s2 = "Hello";
    int lfd = calculateLevensteinDistance(s1, s2);
    double ratio = ((double) lfd) / (Math.max(s1.length, s2.length));
    
    0 讨论(0)
  • 2020-12-05 11:04
     // Refer This: 100% working
    
    public class demo 
    {
    public static void main(String[] args) 
    {
        String str1, str2;
    
        str1="12345";
        str2="122345";
    
    
        int re=pecentageOfTextMatch(str1, str2);
        System.out.println("Matching Percent"+re);
    }
    
    public static int pecentageOfTextMatch(String s0, String s1) 
    {                       // Trim and remove duplicate spaces
        int percentage = 0;
        s0 = s0.trim().replaceAll("\\s+", " ");
        s1 = s1.trim().replaceAll("\\s+", " ");
        percentage=(int) (100 - (float) LevenshteinDistance(s0, s1) * 100 / (float) (s0.length() + s1.length()));
        return percentage;
    }
    
    public static int LevenshteinDistance(String s0, String s1) {
    
        int len0 = s0.length() + 1;
        int len1 = s1.length() + 1;  
        // the array of distances
        int[] cost = new int[len0];
        int[] newcost = new int[len0];
    
        // initial cost of skipping prefix in String s0
        for (int i = 0; i < len0; i++)
            cost[i] = i;
    
        // dynamically computing the array of distances
    
        // transformation cost for each letter in s1
        for (int j = 1; j < len1; j++) {
    
            // initial cost of skipping prefix in String s1
            newcost[0] = j - 1;
    
            // transformation cost for each letter in s0
            for (int i = 1; i < len0; i++) {
    
                // matching current letters in both strings
                int match = (s0.charAt(i - 1) == s1.charAt(j - 1)) ? 0 : 1;
    
                // computing cost for each transformation
                int cost_replace = cost[i - 1] + match;
                int cost_insert = cost[i] + 1;
                int cost_delete = newcost[i - 1] + 1;
    
                // keep minimum cost
                newcost[i] = Math.min(Math.min(cost_insert, cost_delete),
                        cost_replace);
            }
    
            // swap cost/newcost arrays
            int[] swap = cost;
            cost = newcost;
            newcost = swap;
        }
    
        // the distance is the cost for transforming all letters in both strings
        return cost[len0 - 1];
    }
    
    }
    
    0 讨论(0)
  • 2020-12-05 11:12

    I think it would be useful link LevenshteinDistance

    It can be used through maven dependency

    maven dependency

    I do think it is better to use this implementation than write your own code.

    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-text</artifactId>
        <version>1.3</version>
    </dependency>
    

    As an example, have a look at code below

    import org.apache.commons.text.similarity.LevenshteinDistance;
    
    public class MetricUtils {
        private static LevenshteinDistance lv = new LevenshteinDistance();
    
        public static void main(String[] args) {
            String s = "running";
            String s1 = "runninh";
            System.out.println(levensteinRatio(s, s1));
        }
    
        public static double levensteinRatio(String s, String s1) {
            return 1 - ((double) lv.apply(s, s1)) / Math.max(s.length(), s1.length());
        }
    }
    
    0 讨论(0)
  • 2020-12-05 11:16

    You can download Apache Commons StringUtils and investigate (and maybe use) their implementation of Levenshtein distance algorithm.

    0 讨论(0)
  • 2020-12-05 11:19

    The maximum value of the Levenshtein difference between two strings would be the maximum of the length of the two strings. (That corresponds to a change of symbol for each of the characters up to the length of the shorter string, plus inserts or deletes depending on whether you're going from shorter to longer or vice versa.) Given that, the similarity of the two strings must be the ratio between that maximum and the difference between that maximum and the actual Levenshtein difference.

    Implementations of the Levenshtein algorithm tend to not record what those edits should be, but it shouldn't be that hard to calculate given the abstract algorithm on the Wikipedia page.

    0 讨论(0)
提交回复
热议问题