Find the similarity metric between two strings

前端未结

关注

 11  1820

长情又很酷

How do I get the probability of a string being similar to another string in Python?

I want to get a decimal value like 0.9 (meaning 90%) etc. Preferably with standar

相关标签:

11条回答

猫巷女王i

2020-11-22 14:23

The builtin SequenceMatcher is very slow on large input, here's how it can be done with diff-match-patch:

from diff_match_patch import diff_match_patch

def compute_similarity_and_diff(text1, text2):
    dmp = diff_match_patch()
    dmp.Diff_Timeout = 0.0
    diff = dmp.diff_main(text1, text2, False)

    # similarity
    common_text = sum([len(txt) for op, txt in diff if op == 0])
    text_length = max(len(text1), len(text2))
    sim = common_text / text_length

    return sim, diff

0 讨论(0)

闹比i

2020-11-22 14:27

There is a built in.

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

Using it:

>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0

0 讨论(0)

走了就别回头了

2020-11-22 14:27
You can find most of the text similarity methods and how they are calculated under this link: https://github.com/luozhouyang/python-string-similarity#python-string-similarity Here some examples;
- Normalized, metric, similarity and distance
- (Normalized) similarity and distance
- Metric distances
- Shingles (n-gram) based similarity and distance
- Levenshtein
- Normalized Levenshtein
- Weighted Levenshtein
- Damerau-Levenshtein
- Optimal String Alignment
- Jaro-Winkler
- Longest Common Subsequence
- Metric Longest Common Subsequence
- N-Gram
- Shingle(n-gram) based algorithms
- Q-Gram
- Cosine similarity
- Jaccard index
- Sorensen-Dice coefficient
- Overlap coefficient (i.e.,Szymkiewicz-Simpson)
0 讨论(0)
发布评论:

提交评论
- 加载中...
悲&欢浪女

2020-11-22 14:27
There are many metrics to define similarity and distance between strings as mentioned above. I will give my 5 cents by showing an example of Jaccard similarity with Q-Grams and an example with edit distance.

The libraries
```
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
from nltk.metrics.distance  import edit_distance
```
Jaccard Similarity
```
1-jaccard_distance(set(ngrams('Apple', 2)), set(ngrams('Appel', 2)))
```
and we get:
```
0.33333333333333337
```
And for the Apple and Mango
```
1-jaccard_distance(set(ngrams('Apple', 2)), set(ngrams('Mango', 2)))
```
and we get:
```
0.0
```
Edit Distance
```
edit_distance('Apple', 'Appel')
```
and we get:
```
2
```
And finally,
```
edit_distance('Apple', 'Mango')
```
and we get:
```
5
```
Cosine Similarity on Q-Grams (q=2)

Another solution is to work with the textdistance library. I will provide an example of Cosine Similarity
```
import textdistance
1-textdistance.Cosine(qval=2).distance('Apple', 'Appel')
```
and we get:
```
0.5
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
灰色年华

2020-11-22 14:29
I think maybe you are looking for an algorithm describing the distance between strings. Here are some you may refer to:
1. Hamming distance
2. Levenshtein distance
3. Damerau–Levenshtein distance
4. Jaro–Winkler distance
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2