python fuzzywuzzy's process.extract(): how does it work?

懵懂的女人 提交于 2021-02-18 10:55:50

问题


I am trying to understand how the python module fuzzywuzzy's function process.extract() work?

I mainly read about the fuzzywuzzy package here: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/, which is a great post explanining different scenarios when trying to do fuzzy matching. They discussed several scenarios for Partial String Similarity:

1) Out Of Order
2) Token Sort
3) Token Set

And then, from this post: https://pathindependence.wordpress.com/2015/10/31/tutorial-fuzzywuzzy-string-matching-in-python-improving-merge-accuracy-across-data-products-and-naming-conventions/ I learned how to use fuzzywuzzy's process.extract() function to basically select the top k matches.

I cannot find too much info regarding how the process.extract() function works. Here's the definition/information I found on their GitHub page (https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/process.py), that this function:

Find best matches in a list or dictionary of choices, return a list of tuples containing the match and it's score. If a dictionary is used, also returns the key for each match.

However, it does not provide details regarding HOW it's finding the best? Did it take all the 3 scenarios I've mentioned above to find this?

The reason why I ask, is because, when I used this function, sometimes there are two strings that are very similar but are not matched.

for example in my current sample data set, for the to-be-match-string

"Total replenishment lead time (in workdays)"

it is matched to

"PLANNING_TIME_FENCE_CODE", "BUILD_IN_WIP_FLAG"

but not to (the right answer)

"FULL_LEAD_TIME"

Even though the right answer has "lead time" just like the to-be-match-string does, it is not matched to the to-be-match-string at all. WHY? and somehow, the other ones that do not look like the to-be-match-string get to be matched. WHY? I am quite clueless now.


回答1:


There are four ratio in fuzzywuzzy comparison.

  • base_ratio: The Levenshtein Distance of two string.
  • partial_ratio: The ratio of most similar substring.
  • token_sort_ratio: Measure of the sequences' similarity sorting the token before comparing.
  • token_set_ratio: Find all alphanumeric tokens in each string.

More details on ration can be found here http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

By default process.extract() use Partial_ratio for comparison, but you can also override it with third parameter to process.extract()

Ex.

print(fuzz.partial_ratio('Total replenishment lead time (in workdays)', 'Lead_time_planning'))
query = 'Total replenishment lead time (in workdays)'
choices = ['PLANNING_TIME_FENCE_CODE', 'BUILD_IN_WIP_FLAG','Lead_time_planning']
print(process.extract(query, choices))

Results will be :

50
[('Lead_time_planning', 50), ('PLANNING_TIME_FENCE_CODE', 38), ('BUILD_IN_WIP_FLAG', 26)]

Which shows it is by default using partial_ratio, which you can override anytime.




回答2:


The above answer is wrong in a key respect - the inference that the result of process.extract was the same as fuzz.partial_ratio in one case, therefore they are doing the same thing by default.

process.extract actually uses WRatio() by default, which is a weighted combination of the four fuzz ratios. This is actually a cool functionality that empirically works pretty well across fuzzy matching scenarios.

Still, you can manually specify the string comparison function via the scorer argument to extract

Source for process.extract:https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/process.py



来源:https://stackoverflow.com/questions/41171665/python-fuzzywuzzys-process-extract-how-does-it-work

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!