is it possible to do fuzzy match merge with python pandas?

前端 未结 11 1482
[愿得一人]
[愿得一人] 2020-11-22 01:17

I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I woul

相关标签:
11条回答
  • 2020-11-22 01:37

    I have written a Python package which aims to solve this problem:

    pip install fuzzymatcher

    You can find the repo here and docs here.

    Basic usage:

    Given two dataframes df_left and df_right, which you want to fuzzy join, you can write the following:

    from fuzzymatcher import link_table, fuzzy_left_join
    
    # Columns to match on from df_left
    left_on = ["fname", "mname", "lname",  "dob"]
    
    # Columns to match on from df_right
    right_on = ["name", "middlename", "surname", "date"]
    
    # The link table potentially contains several matches for each record
    fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
    

    Or if you just want to link on the closest match:

    fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)
    
    0 讨论(0)
  • 2020-11-22 01:42

    I would use Jaro-Winkler, because it is one of the most performant and accurate approximate string matching algorithms currently available [Cohen, et al.], [Winkler].

    This is how I would do it with Jaro-Winkler from the jellyfish package:

    def get_closest_match(x, list_strings):
    
      best_match = None
      highest_jw = 0
    
      for current_string in list_strings:
        current_score = jellyfish.jaro_winkler(x, current_string)
    
        if(current_score > highest_jw):
          highest_jw = current_score
          best_match = current_string
    
      return best_match
    
    df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
    df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
    
    df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))
    
    df1.join(df2)
    

    Output:

        number  letter
    one     1   a
    two     2   b
    three   3   c
    four    4   d
    five    5   e
    
    0 讨论(0)
  • 2020-11-22 01:45

    http://pandas.pydata.org/pandas-docs/dev/merging.html does not have a hook function to do this on the fly. Would be nice though...

    I would just do a separate step and use difflib getclosest_matches to create a new column in one of the 2 dataframes and the merge/join on the fuzzy matched column

    0 讨论(0)
  • 2020-11-22 01:46

    I used Fuzzymatcher package and this worked well for me. Visit this link for more details on this.

    use the below command to install

    pip install fuzzymatcher
    

    Below is the sample Code (already submitted by RobinL above)

    from fuzzymatcher import link_table, fuzzy_left_join
    
    # Columns to match on from df_left
    left_on = ["fname", "mname", "lname",  "dob"]
    
    # Columns to match on from df_right
    right_on = ["name", "middlename", "surname", "date"]
    
    # The link table potentially contains several matches for each record
    fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
    

    Errors you may get

    1. ZeroDivisionError: float division by zero---> Refer to this link to resolve it
    2. OperationalError: No Such Module:fts4 --> downlaod the sqlite3.dll from here and replace the DLL file in your python or anaconda DLLs folder.

    Pros :

    1. Works faster. In my case, I compared one dataframe with 3000 rows with anohter dataframe with 170,000 records . This also uses SQLite3 search across text. So faster than many
    2. Can check across multiple columns and 2 dataframes. In my case, I was looking for closest match based on address and company name. Sometimes, company name might be same but address is the good thing to check too.
    3. Gives you score for all the closest matches for the same record. you choose whats the cutoff score.

    cons:

    1. Original package installation is buggy
    2. Required C++ and visual studios installed too
    3. Wont work for 64 bit anaconda/Python
    0 讨论(0)
  • 2020-11-22 01:47

    Similar to @locojay suggestion, you can apply difflib's get_close_matches to df2's index and then apply a join:

    In [23]: import difflib 
    
    In [24]: difflib.get_close_matches
    Out[24]: <function difflib.get_close_matches>
    
    In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
    
    In [26]: df2
    Out[26]: 
          letter
    one        a
    two        b
    three      c
    four       d
    five       e
    
    In [31]: df1.join(df2)
    Out[31]: 
           number letter
    one         1      a
    two         2      b
    three       3      c
    four        4      d
    five        5      e
    

    .

    If these were columns, in the same vein you could apply to the column then merge:

    df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
    df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])
    
    df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
    df1.merge(df2)
    
    0 讨论(0)
  • 2020-11-22 01:50

    For more complex use cases to match rows with many columns you can use recordlinkage package. recordlinkage provides all the tools to fuzzy match rows between pandas data frames which helps to deduplicate your data when merging. I have written a detailed article about the package here

    0 讨论(0)
提交回复
热议问题