Is there any way to speed up the fuzzy string match using fuzzywuzzy in pandas.
I have a dataframe as extra_names
which has names that I want to run fuzzy matches for with another dataframe as names_df
.
>> extra_names.head()
not_matching
0 Vij Sales
1 Crom Electronics
2 REL Digital
3 Bajaj Elec
4 Reliance Digi
>> len(extra_names)
6500
>> names_df.head()
names types
0 Vijay Sales 1
1 Croma Electronics 1
2 Reliance Digital 2
3 Bajaj Electronics 2
4 Pai Electricals 2
>> len(names_df)
250
As of now, I'm running the logic using the following code, but its taking forever to complete.
choices = names_df['names'].unique().tolist()
def fuzzy_match(row):
best_match = process.extractOne(row, choices)
return best_match[0], best_match[1] if best_match else '',''
%%timeit
extra_names['best_match'], extra_names['match%'] = extra_names['not_matching'].apply(fuzzy_match)
As I'm posting this question, the query is still running. Is there any way to speed up this fuzzy string matching process?
Let's try difflib
:
import difflib
from functools import partial
f = partial(
difflib.get_close_matches, possibilities=names_df['names'].tolist(), n=1)
matches = extra_names['not_matching'].map(f).str[0].fillna('')
scores = [
difflib.SequenceMatcher(None, x, y).ratio()
for x, y in zip(matches, extra_names['not_matching'])
]
extra_names.assign(best=matches, score=scores)
not_matching best score
0 Vij Sales Vijay Sales 0.900000
1 Crom Electronics Croma Electronics 0.969697
2 REL Digital Reliance Digital 0.666667
3 Bajaj Elec Bajaj Electronics 0.740741
4 Reliance Digi Reliance Digital 0.896552
来源:https://stackoverflow.com/questions/56521625/quicker-way-to-perform-fuzzy-string-match-in-pandas