What are some algorithms for comparing how similar two strings are?

后端未结

关注

 5  1438

I need to compare strings to decide whether they represent the same thing. This relates to case titles entered by humans where abbreviations and other small details may di

相关标签:

5条回答

星月不相逢

2020-11-30 18:23
What you're looking for are called String Metric algorithms. There a significant number of them, many with similar characteristics. Among the more popular:
- Levenshtein Distance : The minimum number of single-character edits required to change one word into the other. Strings do not have to be the same length
- Hamming Distance : The number of characters that are different in two equal length strings.
- Smith–Waterman : A family of algorithms for computing variable sub-sequence similarities.
- Sørensen–Dice Coefficient : A similarity algorithm that computes difference coefficients of adjacent character pairs.
Have a look at these as well as others on the wiki page on the topic.
0 讨论(0)
发布评论:

提交评论
- 加载中...
北海茫月

2020-11-30 18:30

You may use the algorithm of computing the length of the longest common sub-sequence to solve the problem. If the length of the longest common sub-sequence for both the input strings is less than the length of either of the strings, they are unequal.

You may use the approach of dynamic programming to solve the problem and optimize the space complexity as well in case you don't wish to figure out the longest common sub-sequence.

0 讨论(0)
发布评论:

提交评论
- 加载中...
没有蜡笔的小新

2020-11-30 18:31

Damerau Levenshtein distance is another algorithm for comparing two strings and it is similar to the Levenshtein distance algorithm. The difference between the two is that it can also check transpositions between characters and hence may give a better result for error correction.

For example: The Levenshtein distance between night and nigth is 2 but Damerau Levenshtein distance between night and nigth will be 1 because it is just a swap of a pair of characters.

0 讨论(0)
发布评论:

提交评论
- 加载中...

时光说笑

2020-11-30 18:41

Another algorithm that you can consider is the Simon White Similarity:

def get_bigrams(string):
    """
    Take a string and return a list of bigrams.
    """
    if string is None:
        return ""

    s = string.lower()
    return [s[i : i + 2] for i in list(range(len(s) - 1))]

def simon_similarity(str1, str2):
    """
    Perform bigram comparison between two strings
    and return a percentage match in decimal form.
    """
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union = len(pairs1) + len(pairs2)

    if union == 0 or union is None:
        return 0

    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

0 讨论(0)

小蘑菇

2020-11-30 18:45

You could use ngrams for that. For example, transform the two strings in word trigrams (usually lowercase) and compare the percentage of them that are equal to one another.

Your challenge is to define a minimum percentage for similarity.

http://en.wikipedia.org/wiki/N-gram

0 讨论(0)
发布评论:

提交评论
- 加载中...