I want to use something like difflib.get_close_matches
but instead of the most similar strings, I would like to obtain the indexes (i.e. position in the list).
The indexes of the list are more flexible because one can relate the index to other data structures (related to the matched string).
For example, instead of:
>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> difflib.get_close_matches('Hello', words)
['hello', 'hallo', 'Hallo']
I would like:
>>> difflib.get_close_matches('Hello', words)
[0, 1, 6]
There doesn't seem to exist a parameter to obtain this result, is there an alternative to difflib.get_close_matches()
that returns the indexes?
My research towards an alternative
I know I could use difflib.SequenceMatcher
, and then compare the strings one-to-one with ratio
(or quick_ratio
). However, I am afraid that this would be very inefficient, because:
I would have to create thousands of SequenceMatcher objects and compare them (I am expecting that
get_close_matches
avoid the use of the class):EDIT: False. I checked the source code of
get_close_matches
, it actually usesSequenceMatcher
.there is no cutoff (I am guessing that there is an optimization that avoids the calculation of the ratio for all the string)
EDIT: Partially False. The code is
get_close_matches
does not have any major optimizations, except it usesreal_quick_ratio
,quick_ratio
andratio
alltogether. In any case I can easily copy the optimization into my own function. Also I didn't consider that the SequenceMatcher has methods to set the sequences:set_seq1
,set_seq2
, so at least I won't have to create an object each time.as far as I understand, all python libraries are C compiled and this would increase performance.
EDIT: I am quite sure this is the case. The function is in the folder called cpython.
EDIT: There is a small difference (p-value is 0.030198) between executing directly from difflib and copy the function in a file mydifflib.py.
ipdb> timeit.repeat("gcm('hello', _vals)", setup="from difflib import get_close_matches as gcm; _vals=['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']", number=100000, repeat=10) [13.230449825001415, 13.126462900007027, 12.965455356999882, 12.955717618009658, 13.066136312991148, 12.935014379996574, 13.082025538009475, 12.943519036009093, 13.149949093989562, 12.970130036002956] ipdb> timeit.repeat("gcm('hello', _vals)", setup="from mydifflib import get_close_matches as gcm; _vals=['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']", number=100000, repeat=10) [13.363269686000422, 13.087718107010005, 13.112324478992377, 13.358293497993145, 13.283965317998081, 13.056695280989516, 13.021098569995956, 13.04310674899898, 13.024205000008806, 13.152750282009947]
Nevertheless it is not nearly as bad as I expected, I think I will proceed unless anybody know another library or alternative.
I took the source code for get_close_matches
, and modify it in order to return the indexes instead of the string values.
# mydifflib.py
from difflib import SequenceMatcher
from heapq import nlargest as _nlargest
def get_close_matches_indexes(word, possibilities, n=3, cutoff=0.6):
"""Use SequenceMatcher to return a list of the indexes of the best
"good enough" matches. word is a sequence for which close matches
are desired (typically a string).
possibilities is a list of sequences against which to match word
(typically a list of strings).
Optional arg n (default 3) is the maximum number of close matches to
return. n must be > 0.
Optional arg cutoff (default 0.6) is a float in [0, 1]. Possibilities
that don't score at least that similar to word are ignored.
"""
if not n > 0:
raise ValueError("n must be > 0: %r" % (n,))
if not 0.0 <= cutoff <= 1.0:
raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
result = []
s = SequenceMatcher()
s.set_seq2(word)
for idx, x in enumerate(possibilities):
s.set_seq1(x)
if s.real_quick_ratio() >= cutoff and \
s.quick_ratio() >= cutoff and \
s.ratio() >= cutoff:
result.append((s.ratio(), idx))
# Move the best scorers to head of list
result = _nlargest(n, result)
# Strip scores for the best n matches
return [x for score, x in result]
Usage
>>> from mydifflib import get_close_matches_indexes
>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> get_close_matches_indexes('hello', words)
[0, 1, 6]
Now, I can relate this indexes to associated data of the string without having to search back the strings.
来源:https://stackoverflow.com/questions/50861237/is-there-an-alternative-to-difflib-get-close-matches-that-returns-indexes-l