is it possible to do fuzzy match merge with python pandas?

前端 未结 11 1467
[愿得一人]
[愿得一人] 2020-11-22 01:17

I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I woul

相关标签:
11条回答
  • 2020-11-22 01:59

    For a general approach: fuzzy_merge

    For a more general scenario in which we want to merge columns from two dataframes which contain slightly different strings, the following function uses difflib.get_close_matches along with merge in order to mimic the functionality of pandas' merge but with fuzzy matching:

    import difflib 
    
    def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
        df_other= df2.copy()
        df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff) 
                             for x in df_other[right_on]]
        return df1.merge(df_other, on=left_on, how=how)
    
    def get_closest_match(x, other, cutoff):
        matches = difflib.get_close_matches(x, other, cutoff=cutoff)
        return matches[0] if matches else None
    

    Here are some use cases with two sample dataframes:

    print(df1)
    
         key   number
    0    one       1
    1    two       2
    2  three       3
    3   four       4
    4   five       5
    
    print(df2)
    
                     key_close  letter
    0                    three      c
    1                      one      a
    2                      too      b
    3                    fours      d
    4  a very different string      e
    

    With the above example, we'd get:

    fuzzy_merge(df1, df2, left_on='key', right_on='key_close')
    
         key  number key_close letter
    0    one       1       one      a
    1    two       2       too      b
    2  three       3     three      c
    3   four       4     fours      d
    

    And we could do a left join with:

    fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='left')
    
         key  number key_close letter
    0    one       1       one      a
    1    two       2       too      b
    2  three       3     three      c
    3   four       4     fours      d
    4   five       5       NaN    NaN
    

    For a left join, we'd have all non-matching keys in the left dataframe to None:

    fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='right')
    
         key  number                key_close letter
    0    one     1.0                      one      a
    1    two     2.0                      too      b
    2  three     3.0                    three      c
    3   four     4.0                    fours      d
    4   None     NaN  a very different string      e
    

    Also note that difflib.get_close_matches will return an empty list if no item is matched within the cutoff. In the shared example, if we change the last index in df2 to say:

    print(df2)
    
                              letter
    one                          a
    too                          b
    three                        c
    fours                        d
    a very different string      e
    

    We'd get an index out of range error:

    df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
    

    IndexError: list index out of range

    In order to solve this the above function get_closest_match will return the closest match by indexing the list returned by difflib.get_close_matches only if it actually contains any matches.

    0 讨论(0)
  • 2020-11-22 02:00

    You can use d6tjoin for that

    import d6tjoin.top1
    d6tjoin.top1.MergeTop1(df1.reset_index(),df2.reset_index(),
           fuzzy_left_on=['index'],fuzzy_right_on=['index']).merge()['merged']
    

    index number index_right letter 0 one 1 one a 1 two 2 too b 2 three 3 three c 3 four 4 fours d 4 five 5 five e

    It has a variety of additional features such as:

    • check join quality, pre and post join
    • customize similarity function, eg edit distance vs hamming distance
    • specify max distance
    • multi-core compute

    For details see

    • MergeTop1 examples - Best match join examples notebook
    • PreJoin examples - Examples for diagnosing join problems
    0 讨论(0)
  • 2020-11-22 02:02

    Using fuzzywuzzy

    2019 answer

    Since there are no examples with the fuzzywuzzy package, here's a function I wrote which will return all matches based on a threshold you can set as a user:


    Example datframe

    df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
    df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
    
    # df1
              Key
    0       Apple
    1      Banana
    2      Orange
    3  Strawberry
    
    # df2
            Key
    0      Aple
    1     Mango
    2      Orag
    3     Straw
    4  Bannanna
    5     Berry
    

    Function for fuzzy matching

    def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
        """
        :param df_1: the left table to join
        :param df_2: the right table to join
        :param key1: key column of the left table
        :param key2: key column of the right table
        :param threshold: how close the matches should be to return a match, based on Levenshtein distance
        :param limit: the amount of matches that will get returned, these are sorted high to low
        :return: dataframe with boths keys and matches
        """
        s = df_2[key2].tolist()
    
        m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
        df_1['matches'] = m
    
        m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
        df_1['matches'] = m2
    
        return df_1
    

    Using our function on the dataframes: #1

    from fuzzywuzzy import fuzz
    from fuzzywuzzy import process
    
    fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)
    
              Key       matches
    0       Apple          Aple
    1      Banana      Bannanna
    2      Orange          Orag
    3  Strawberry  Straw, Berry
    

    Using our function on the dataframes: #2

    df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
    df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})
    
    fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)
    
            Col1  matches
    0  Microsoft  Mcrsoft
    1     Google    gogle
    2     Amazon   Amason
    3        IBM         
    

    Installation:

    Pip

    pip install fuzzywuzzy
    

    Anaconda

    conda install -c conda-forge fuzzywuzzy
    
    0 讨论(0)
  • 2020-11-22 02:03

    As a heads up, this basically works, except if no match is found, or if you have NaNs in either column. Instead of directly applying get_close_matches, I found it easier to apply the following function. The choice of NaN replacements will depend a lot on your dataset.

    def fuzzy_match(a, b):
        left = '1' if pd.isnull(a) else a
        right = b.fillna('2')
        out = difflib.get_close_matches(left, right)
        return out[0] if out else np.NaN
    
    0 讨论(0)
  • 2020-11-22 02:03

    There is a package called fuzzy_pandas that can use levenshtein, jaro, metaphone and bilenco methods. With some great examples here

    import pandas as pd
    import fuzzy_pandas as fpd
    
    df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
    df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
    
    results = fpd.fuzzy_merge(df1, df2,
                left_on='Key',
                right_on='Key',
                method='levenshtein',
                threshold=0.6)
    
    results.head()
    
    
      Key    Key
    0 Apple  Aple
    1 Banana Bannanna
    2 Orange Orag
    
    0 讨论(0)
提交回复
热议问题