Find the similarity metric between two strings

前端未结

关注

 11  1819

长情又很酷

How do I get the probability of a string being similar to another string in Python?

I want to get a decimal value like 0.9 (meaning 90%) etc. Preferably with standar

相关标签:

11条回答

长发绾君心

2020-11-22 14:02
Fuzzy Wuzzy is a package that implements Levenshtein distance in python, with some helper functions to help in certain situations where you may want two distinct strings to be considered identical. For example:
```
>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
遥遥无期

2020-11-22 14:02
Package distance includes Levenshtein distance:
```
import distance
distance.levenshtein("lenvestein", "levenshtein")
# 3
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
盖世英雄少女心

2020-11-22 14:06
Solution #1: Python builtin

use SequenceMatcher from difflib

pros: native python library, no need extra package.
cons: too limited, there are so many other good algorithms for string similarity out there.
example :
```
>>> from difflib import SequenceMatcher
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
```
Solution #2: jellyfish library

its a very good library with good coverage and few issues. it supports:
- Levenshtein Distance
- Damerau-Levenshtein Distance
- Jaro Distance
- Jaro-Winkler Distance
- Match Rating Approach Comparison
- Hamming Distance

pros: easy to use, gamut of supported algorithms, tested.
cons: not native library.

example:
```
>>> import jellyfish
>>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
2
>>> jellyfish.jaro_distance(u'jellyfish', u'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')
1
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

执笔经年

2020-11-22 14:07

Note, difflib.SequenceMatcher only finds the longest contiguous matching subsequence, this is often not what is desired, for example:

>>> a1 = "Apple"
>>> a2 = "Appel"
>>> a1 *= 50
>>> a2 *= 50
>>> SequenceMatcher(None, a1, a2).ratio()
0.012  # very low
>>> SequenceMatcher(None, a1, a2).get_matching_blocks()
[Match(a=0, b=0, size=3), Match(a=250, b=250, size=0)]  # only the first block is recorded

Finding the similarity between two strings is closely related to the concept of pairwise sequence alignment in bioinformatics. There are many dedicated libraries for this including biopython. This example implements the Needleman Wunsch algorithm:

>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.score(a1, a2)
200.0
>>> aligner.algorithm
'Needleman-Wunsch'

Using biopython or another bioinformatics package is more flexible than any part of the python standard library since many different scoring schemes and algorithms are available. Also, you can actually get the matching sequences to visualise what is happening:

>>> alignment = next(aligner.align(a1, a2))
>>> alignment.score
200.0
>>> print(alignment)
Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-
|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-
App-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-el

0 讨论(0)

猫巷女王i

2020-11-22 14:13
Textdistance:

TextDistance – python library for comparing distance between two or more sequences by many algorithms. It has Textdistance
- 30+ algorithms
- Pure python implementation
- Simple usage
- More than two sequences comparing
- Some algorithms have more than one implementation in one class.
- Optional numpy usage for maximum speed.
Example1:
```
import textdistance
textdistance.hamming('test', 'text')
```
Output:

1

Example2:
```
import textdistance

textdistance.hamming.normalized_similarity('test', 'text')
```
Output:

0.75

Thanks and Cheers!!!
0 讨论(0)
发布评论:

提交评论
- 加载中...

借酒劲吻你

2020-11-22 14:16

You can create a function like:

def similar(w1, w2):
    w1 = w1 + ' ' * (len(w2) - len(w1))
    w2 = w2 + ' ' * (len(w1) - len(w2))
    return sum(1 if i == j else 0 for i, j in zip(w1, w2)) / float(len(w1))

0 讨论(0)

1 2 下一页

Find the similarity metric between two strings

Solution #1: Python builtin

Solution #2: jellyfish library