How do I get the probability of a string being similar to another string in Python?
I want to get a decimal value like 0.9 (meaning 90%) etc. Preferably with standar
The builtin SequenceMatcher
is very slow on large input, here's how it can be done with diff-match-patch:
from diff_match_patch import diff_match_patch
def compute_similarity_and_diff(text1, text2):
dmp = diff_match_patch()
dmp.Diff_Timeout = 0.0
diff = dmp.diff_main(text1, text2, False)
# similarity
common_text = sum([len(txt) for op, txt in diff if op == 0])
text_length = max(len(text1), len(text2))
sim = common_text / text_length
return sim, diff
There is a built in.
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
Using it:
>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0
You can find most of the text similarity methods and how they are calculated under this link: https://github.com/luozhouyang/python-string-similarity#python-string-similarity Here some examples;
Normalized, metric, similarity and distance
(Normalized) similarity and distance
Metric distances
There are many metrics to define similarity and distance between strings as mentioned above. I will give my 5 cents by showing an example of Jaccard similarity
with Q-Grams
and an example with edit distance
.
The libraries
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
from nltk.metrics.distance import edit_distance
Jaccard Similarity
1-jaccard_distance(set(ngrams('Apple', 2)), set(ngrams('Appel', 2)))
and we get:
0.33333333333333337
And for the Apple
and Mango
1-jaccard_distance(set(ngrams('Apple', 2)), set(ngrams('Mango', 2)))
and we get:
0.0
Edit Distance
edit_distance('Apple', 'Appel')
and we get:
2
And finally,
edit_distance('Apple', 'Mango')
and we get:
5
Cosine Similarity on Q-Grams (q=2)
Another solution is to work with the textdistance
library. I will provide an example of Cosine Similarity
import textdistance
1-textdistance.Cosine(qval=2).distance('Apple', 'Appel')
and we get:
0.5
I think maybe you are looking for an algorithm describing the distance between strings. Here are some you may refer to: