Searching one Python dataframe / dictionary for fuzzy matches in another dataframe

ぐ巨炮叔叔 提交于 2019-12-04 17:01:46

using fuzz.ratio as my distance metric, calculate my distance matrix like this

df3 = pd.DataFrame(index=df.index, columns=df2.index)

for i in df3.index:
    for j in df3.columns:
        vi = df.get_value(i, 'PRODUCT_DESCRIPTION')
        vj = df2.get_value(j, 'PROD_DESCRIPTION')
        df3.set_value(
            i, j, fuzz.ratio(vi, vj))

print(df3)

    0   1   2   3   4   5
0  63  15  24  23  34  27
1  26  84  19  21  52  32
2  18  31  33  12  35  34
3  10  31  35  10  41  42
4  29  52  32  10  42  12
5  15  28  21  49   8  55

Set a threshold for acceptable distance. I set 50
Find the index value (for df2) that has maximum value for every row.

threshold = df3.max(1) > 50
idxmax = df3.idxmax(1)

Make assignments

df['PROD_ID'] = np.where(threshold, df2.loc[idxmax, 'PROD_ID'].values, np.nan)
df['PROD_DESCRIPTION'] = np.where(threshold, df2.loc[idxmax, 'PROD_DESCRIPTION'].values, np.nan)
df

You should be able to iterate over both dataframes and populate either a dict of a 3rd dataframe with your desired information:

d = {
    'df1_id': [],
    'df1_prod_desc': [],
    'df2_id': [],
    'df2_prod_desc': [],
    'fuzzywuzzy_sim': []
}
for _, df1_row in df1.iterrows():
    for _, df2_row in df2.iterrows():
        d['df1_id'] = df1_row['PRODUCT_ID']
        ...
df3 = pd.DataFrame.from_dict(d)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!