is it possible to do fuzzy match merge with python pandas?

前端未结

关注

 11  1466

I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I woul

相关标签:

11条回答

面向向阳花

2020-11-22 01:37

I have written a Python package which aims to solve this problem:

pip install fuzzymatcher

You can find the repo here and docs here.

Basic usage:

Given two dataframes df_left and df_right, which you want to fuzzy join, you can write the following:

from fuzzymatcher import link_table, fuzzy_left_join

# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)

Or if you just want to link on the closest match:

fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)

0 讨论(0)

南方客

2020-11-22 01:42

I would use Jaro-Winkler, because it is one of the most performant and accurate approximate string matching algorithms currently available [Cohen, et al.], [Winkler].

This is how I would do it with Jaro-Winkler from the jellyfish package:

def get_closest_match(x, list_strings):

  best_match = None
  highest_jw = 0

  for current_string in list_strings:
    current_score = jellyfish.jaro_winkler(x, current_string)

    if(current_score > highest_jw):
      highest_jw = current_score
      best_match = current_string

  return best_match

df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))

df1.join(df2)

Output:

    number  letter
one     1   a
two     2   b
three   3   c
four    4   d
five    5   e

0 讨论(0)

天命终不由人

2020-11-22 01:45

http://pandas.pydata.org/pandas-docs/dev/merging.html does not have a hook function to do this on the fly. Would be nice though...

I would just do a separate step and use difflib getclosest_matches to create a new column in one of the 2 dataframes and the merge/join on the fuzzy matched column

0 讨论(0)
发布评论:

提交评论
- 加载中...
北恋

2020-11-22 01:46
I used Fuzzymatcher package and this worked well for me. Visit this link for more details on this.

use the below command to install
```
pip install fuzzymatcher
```
Below is the sample Code (already submitted by RobinL above)
```
from fuzzymatcher import link_table, fuzzy_left_join

# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
```
Errors you may get
1. ZeroDivisionError: float division by zero---> Refer to this link to resolve it
2. OperationalError: No Such Module:fts4 --> downlaod the sqlite3.dll from here and replace the DLL file in your python or anaconda DLLs folder.
Pros :
1. Works faster. In my case, I compared one dataframe with 3000 rows with anohter dataframe with 170,000 records . This also uses SQLite3 search across text. So faster than many
2. Can check across multiple columns and 2 dataframes. In my case, I was looking for closest match based on address and company name. Sometimes, company name might be same but address is the good thing to check too.
3. Gives you score for all the closest matches for the same record. you choose whats the cutoff score.
cons:
1. Original package installation is buggy
2. Required C++ and visual studios installed too
3. Wont work for 64 bit anaconda/Python
0 讨论(0)
发布评论:

提交评论
- 加载中...

闹比i

2020-11-22 01:47

Similar to @locojay suggestion, you can apply difflib's get_close_matches to df2's index and then apply a join:

In [23]: import difflib 

In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>

In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

In [26]: df2
Out[26]: 
      letter
one        a
two        b
three      c
four       d
five       e

In [31]: df1.join(df2)
Out[31]: 
       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e

If these were columns, in the same vein you could apply to the column then merge:

df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)

0 讨论(0)

北海茫月

2020-11-22 01:50

For more complex use cases to match rows with many columns you can use recordlinkage package. recordlinkage provides all the tools to fuzzy match rows between pandas data frames which helps to deduplicate your data when merging. I have written a detailed article about the package here

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页