Apply fuzzy matching across a dataframe column and save results in a new column

前端 未结 1 456
时光说笑
时光说笑 2020-11-28 10:38

I have two data frames with each having a different number of rows. Below is a couple rows from each data set

df1 =
     Company                                      


        
相关标签:
1条回答
  • 2020-11-28 11:23

    I couldn't tell what you were doing. This is how I would do it.

    from fuzzywuzzy import fuzz
    from fuzzywuzzy import process
    

    Create a series of tuples to compare:

    compare = pd.MultiIndex.from_product([df1['Company'],
                                          df2['FDA Company']]).to_series()
    

    Create a special function to calculate fuzzy metrics and return a series.

    def metrics(tup):
        return pd.Series([fuzz.ratio(*tup),
                          fuzz.token_sort_ratio(*tup)],
                         ['ratio', 'token'])
    

    Apply metrics to the compare series

    compare.apply(metrics)
    

    There are bunch of ways to do this next part:

    Get closest matches to each row of df1

    compare.apply(metrics).unstack().idxmax().unstack(0)
    

    Get closest matches to each row of df2

    compare.apply(metrics).unstack(0).idxmax().unstack(0)
    

    0 讨论(0)
提交回复
热议问题