Searching one Python dataframe / dictionary for fuzzy matches in another dataframe

后端 未结 2 658
北恋
北恋 2021-01-03 03:34

I have the following pandas dataframe with 50,000 unique rows and 20 columns (included is a snippet of the relevant columns):

df1:

          


        
相关标签:
2条回答
  • 2021-01-03 04:26

    You should be able to iterate over both dataframes and populate either a dict of a 3rd dataframe with your desired information:

    d = {
        'df1_id': [],
        'df1_prod_desc': [],
        'df2_id': [],
        'df2_prod_desc': [],
        'fuzzywuzzy_sim': []
    }
    for _, df1_row in df1.iterrows():
        for _, df2_row in df2.iterrows():
            d['df1_id'] = df1_row['PRODUCT_ID']
            ...
    df3 = pd.DataFrame.from_dict(d)
    
    0 讨论(0)
  • 2021-01-03 04:28

    using fuzz.ratio as my distance metric, calculate my distance matrix like this

    df3 = pd.DataFrame(index=df.index, columns=df2.index)
    
    for i in df3.index:
        for j in df3.columns:
            vi = df.get_value(i, 'PRODUCT_DESCRIPTION')
            vj = df2.get_value(j, 'PROD_DESCRIPTION')
            df3.set_value(
                i, j, fuzz.ratio(vi, vj))
    
    print(df3)
    
        0   1   2   3   4   5
    0  63  15  24  23  34  27
    1  26  84  19  21  52  32
    2  18  31  33  12  35  34
    3  10  31  35  10  41  42
    4  29  52  32  10  42  12
    5  15  28  21  49   8  55
    

    Set a threshold for acceptable distance. I set 50
    Find the index value (for df2) that has maximum value for every row.

    threshold = df3.max(1) > 50
    idxmax = df3.idxmax(1)
    

    Make assignments

    df['PROD_ID'] = np.where(threshold, df2.loc[idxmax, 'PROD_ID'].values, np.nan)
    df['PROD_DESCRIPTION'] = np.where(threshold, df2.loc[idxmax, 'PROD_DESCRIPTION'].values, np.nan)
    df
    

    0 讨论(0)
提交回复
热议问题