Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column

烂漫一生 提交于 2019-12-01 00:35:59

This solution leverages apply() and should demonstrate reasonable performance improvements. Feel free to play around with the scorer and change the threshold to meet your needs:

import pandas as pd, numpy as np
from fuzzywuzzy import process, fuzz

df = pd.DataFrame([['cliftonlarsonallen llp minneapolis MN'],
        ['loeb and troper llp newyork NY'],
        ["dauby o'connor and zaleski llc carmel IN"],
        ['wegner cpas llp madison WI']],
        columns=['org_name'])

org_list = df['org_name']

threshold = 40

def find_match(x):

  match = process.extract(x, org_list, limit=2, scorer=fuzz.partial_token_sort_ratio)[1]
  match = match if match[1]>threshold else np.nan
  return match

df['match found'] = df['org_name'].apply(find_match)

Returns:

                                   org_name                                     match found
0     cliftonlarsonallen llp minneapolis MN             (wegner cpas llp madison WI, 50, 3)
1            loeb and troper llp newyork NY             (wegner cpas llp madison WI, 46, 3)
2  dauby o'connor and zaleski llc carmel IN                                             NaN
3                wegner cpas llp madison WI  (cliftonlarsonallen llp minneapolis MN, 50, 0)

If you would just like to return the matching string itself, then you can modify as follows:

match = match[0] if match[1]>threshold else np.nan

I've added @user3483203's comment pertaining to a list comprehension here as an alternative option as well:

df['match found'] = [find_match(row) for row in df['org_name']]

Note that process.extract() is designed to handle a single query string and apply the passed scoring algorithm to that query and the supplied match options. For that reason, you will have to evaluate that query against all 70,000 match options (the way you currently have your code setup). So therefore, you will be evaluating len(match_options)**2 (or 4,900,000,000) string comparisons. Therefore, I think the best performance improvements could be achieved by limiting the potential match options via more extensive logic in the find_match() function, e.g. enforcing that the match options start with the same letter as the query, etc.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!