Find the similarity metric between two strings

前端 未结 11 1820
长情又很酷
长情又很酷 2020-11-22 13:24

How do I get the probability of a string being similar to another string in Python?

I want to get a decimal value like 0.9 (meaning 90%) etc. Preferably with standar

相关标签:
11条回答
  • 2020-11-22 14:23

    The builtin SequenceMatcher is very slow on large input, here's how it can be done with diff-match-patch:

    from diff_match_patch import diff_match_patch
    
    def compute_similarity_and_diff(text1, text2):
        dmp = diff_match_patch()
        dmp.Diff_Timeout = 0.0
        diff = dmp.diff_main(text1, text2, False)
    
        # similarity
        common_text = sum([len(txt) for op, txt in diff if op == 0])
        text_length = max(len(text1), len(text2))
        sim = common_text / text_length
    
        return sim, diff
    
    0 讨论(0)
  • 2020-11-22 14:27

    There is a built in.

    from difflib import SequenceMatcher
    
    def similar(a, b):
        return SequenceMatcher(None, a, b).ratio()
    

    Using it:

    >>> similar("Apple","Appel")
    0.8
    >>> similar("Apple","Mango")
    0.0
    
    0 讨论(0)
  • 2020-11-22 14:27

    You can find most of the text similarity methods and how they are calculated under this link: https://github.com/luozhouyang/python-string-similarity#python-string-similarity Here some examples;

    • Normalized, metric, similarity and distance

    • (Normalized) similarity and distance

    • Metric distances

    • Shingles (n-gram) based similarity and distance
    • Levenshtein
    • Normalized Levenshtein
    • Weighted Levenshtein
    • Damerau-Levenshtein
    • Optimal String Alignment
    • Jaro-Winkler
    • Longest Common Subsequence
    • Metric Longest Common Subsequence
    • N-Gram
    • Shingle(n-gram) based algorithms
    • Q-Gram
    • Cosine similarity
    • Jaccard index
    • Sorensen-Dice coefficient
    • Overlap coefficient (i.e.,Szymkiewicz-Simpson)
    0 讨论(0)
  • 2020-11-22 14:27

    There are many metrics to define similarity and distance between strings as mentioned above. I will give my 5 cents by showing an example of Jaccard similarity with Q-Grams and an example with edit distance.

    The libraries

    from nltk.metrics.distance import jaccard_distance
    from nltk.util import ngrams
    from nltk.metrics.distance  import edit_distance
    

    Jaccard Similarity

    1-jaccard_distance(set(ngrams('Apple', 2)), set(ngrams('Appel', 2)))
    

    and we get:

    0.33333333333333337
    

    And for the Apple and Mango

    1-jaccard_distance(set(ngrams('Apple', 2)), set(ngrams('Mango', 2)))
    

    and we get:

    0.0
    

    Edit Distance

    edit_distance('Apple', 'Appel')
    

    and we get:

    2
    

    And finally,

    edit_distance('Apple', 'Mango')
    

    and we get:

    5
    

    Cosine Similarity on Q-Grams (q=2)

    Another solution is to work with the textdistance library. I will provide an example of Cosine Similarity

    import textdistance
    1-textdistance.Cosine(qval=2).distance('Apple', 'Appel')
    

    and we get:

    0.5
    
    0 讨论(0)
  • 2020-11-22 14:29

    I think maybe you are looking for an algorithm describing the distance between strings. Here are some you may refer to:

    1. Hamming distance
    2. Levenshtein distance
    3. Damerau–Levenshtein distance
    4. Jaro–Winkler distance
    0 讨论(0)
提交回复
热议问题