I am trying to look for potential matches in a PANDAS column full of organization names. I am currently using iterrows() but it is extremely slow on a dataframe with ~70,000 rows. After having looked through StackOverflow I have tried implementing a lambda row (apply) method but that seems to barely speed things up, if at all.
The first four rows of the dataframe look like this:
index org_name
0 cliftonlarsonallen llp minneapolis MN
1 loeb and troper llp newyork NY
2 dauby o'connor and zaleski llc carmel IN
3 wegner cpas llp madison WI
The following code block works but took around five days to process:
org_list = df['org_name']
from fuzzywuzzy import process
for index, row in df.iterrows():
x = process.extract(row['org_name'], org_list, limit=2)[1]
if x[1]>93:
df.loc[index, 'fuzzy_match'] = x[0]
df.loc[index, 'fuzzy_match_score'] = x[1]
In effect, for each row I am comparing the organization name against the list of all organization names, taking the top two matches, then selecting the second-best match (because the top match will be the identical name), and then setting a condition that the score must be higher than 93 in order to create the new columns. The reason I'm creating additional columns is that I do not want to simply replace values -- I'd like to double-check the results first.
Is there a way to speed this up? I read several blog posts and StackOverflow questions that talked about 'vectorizing' this code but my attempts at that failed. I also considered simply creating a 70,000 x 70,000 Levenshtein distance matrix and then extracting information from there. Is there a quicker way to generate the best match for each element in a list or PANDAS column?
This solution leverages apply()
and should demonstrate reasonable performance improvements. Feel free to play around with the scorer
and change the threshold
to meet your needs:
import pandas as pd, numpy as np
from fuzzywuzzy import process, fuzz
df = pd.DataFrame([['cliftonlarsonallen llp minneapolis MN'],
['loeb and troper llp newyork NY'],
["dauby o'connor and zaleski llc carmel IN"],
['wegner cpas llp madison WI']],
columns=['org_name'])
org_list = df['org_name']
threshold = 40
def find_match(x):
match = process.extract(x, org_list, limit=2, scorer=fuzz.partial_token_sort_ratio)[1]
match = match if match[1]>threshold else np.nan
return match
df['match found'] = df['org_name'].apply(find_match)
Returns:
org_name match found
0 cliftonlarsonallen llp minneapolis MN (wegner cpas llp madison WI, 50, 3)
1 loeb and troper llp newyork NY (wegner cpas llp madison WI, 46, 3)
2 dauby o'connor and zaleski llc carmel IN NaN
3 wegner cpas llp madison WI (cliftonlarsonallen llp minneapolis MN, 50, 0)
If you would just like to return the matching string itself, then you can modify as follows:
match = match[0] if match[1]>threshold else np.nan
I've added @user3483203's comment pertaining to a list comprehension here as an alternative option as well:
df['match found'] = [find_match(row) for row in df['org_name']]
Note that process.extract()
is designed to handle a single query string and apply the passed scoring algorithm to that query and the supplied match options. For that reason, you will have to evaluate that query against all 70,000 match options (the way you currently have your code setup). So therefore, you will be evaluating len(match_options)**2
(or 4,900,000,000) string comparisons. Therefore, I think the best performance improvements could be achieved by limiting the potential match options via more extensive logic in the find_match()
function, e.g. enforcing that the match options start with the same letter as the query, etc.
来源:https://stackoverflow.com/questions/52631291/vectorizing-or-speeding-up-fuzzywuzzy-string-matching-on-pandas-column