Better fuzzy matching performance?

风流意气都作罢 提交于 2020-07-05 04:39:06

问题


I'm currently using method get_close_matches method from difflib to iterate through a list of 15,000 strings to get the closest match against another list of approx 15,000 strings:

a=['blah','pie','apple'...]
b=['jimbo','zomg','pie'...]

for value in a:
    difflib.get_close_matches(value,b,n=1,cutoff=.85)

It takes .58 seconds per value which means it will take 8,714 seconds or 145 minutes to finish the loop. Is there another library/method that might be faster or a way to improve the speed for this method? I've already tried converting both arrays to lower case, but it only resulted in a slight speed increase.


回答1:


fuzzyset indexes strings by their bigrams and trigrams so it finds approximate matches in O(log(N)) vs O(N) for difflib. For my fuzzyset of 1M+ words and word-pairs it can compute the index in about 20 seconds and find the closest match in less than a 100 ms.




回答2:


Perhaps you can build an index of the trigrams (three consecutive letters) that appear in each list. Only check strings in a against strings in b that share a trigram.

You might want to look at the BLAST bioinformatics tool; it does approximate sequence alignments against a sequence database.




回答3:


Try this

https://code.google.com/p/pylevenshtein/

The Levenshtein Python C extension module contains functions for fast computation of - Levenshtein (edit) distance, and edit operations - string similarity - approximate median strings, and generally string averaging - string sequence and set similarity It supports both normal and Unicode strings.




回答4:


I had tried few methods for fuzzy match. the best one was cosine similarity, with threshold as per your need (i kept 80% fuzzy match).



来源:https://stackoverflow.com/questions/21408760/better-fuzzy-matching-performance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!