is it possible to do fuzzy match merge with python pandas?

前端 未结 11 1481
[愿得一人]
[愿得一人] 2020-11-22 01:17

I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I woul

11条回答
  •  逝去的感伤
    2020-11-22 01:59

    For a general approach: fuzzy_merge

    For a more general scenario in which we want to merge columns from two dataframes which contain slightly different strings, the following function uses difflib.get_close_matches along with merge in order to mimic the functionality of pandas' merge but with fuzzy matching:

    import difflib 
    
    def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
        df_other= df2.copy()
        df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff) 
                             for x in df_other[right_on]]
        return df1.merge(df_other, on=left_on, how=how)
    
    def get_closest_match(x, other, cutoff):
        matches = difflib.get_close_matches(x, other, cutoff=cutoff)
        return matches[0] if matches else None
    

    Here are some use cases with two sample dataframes:

    print(df1)
    
         key   number
    0    one       1
    1    two       2
    2  three       3
    3   four       4
    4   five       5
    
    print(df2)
    
                     key_close  letter
    0                    three      c
    1                      one      a
    2                      too      b
    3                    fours      d
    4  a very different string      e
    

    With the above example, we'd get:

    fuzzy_merge(df1, df2, left_on='key', right_on='key_close')
    
         key  number key_close letter
    0    one       1       one      a
    1    two       2       too      b
    2  three       3     three      c
    3   four       4     fours      d
    

    And we could do a left join with:

    fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='left')
    
         key  number key_close letter
    0    one       1       one      a
    1    two       2       too      b
    2  three       3     three      c
    3   four       4     fours      d
    4   five       5       NaN    NaN
    

    For a left join, we'd have all non-matching keys in the left dataframe to None:

    fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='right')
    
         key  number                key_close letter
    0    one     1.0                      one      a
    1    two     2.0                      too      b
    2  three     3.0                    three      c
    3   four     4.0                    fours      d
    4   None     NaN  a very different string      e
    

    Also note that difflib.get_close_matches will return an empty list if no item is matched within the cutoff. In the shared example, if we change the last index in df2 to say:

    print(df2)
    
                              letter
    one                          a
    too                          b
    three                        c
    fours                        d
    a very different string      e
    

    We'd get an index out of range error:

    df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
    

    IndexError: list index out of range

    In order to solve this the above function get_closest_match will return the closest match by indexing the list returned by difflib.get_close_matches only if it actually contains any matches.

提交回复
热议问题