I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I woul
I have written a Python package which aims to solve this problem:
pip install fuzzymatcher
You can find the repo here and docs here.
Basic usage:
Given two dataframes df_left
and df_right
, which you want to fuzzy join, you can write the following:
from fuzzymatcher import link_table, fuzzy_left_join
# Columns to match on from df_left
left_on = ["fname", "mname", "lname", "dob"]
# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]
# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
Or if you just want to link on the closest match:
fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)
I would use Jaro-Winkler, because it is one of the most performant and accurate approximate string matching algorithms currently available [Cohen, et al.], [Winkler].
This is how I would do it with Jaro-Winkler from the jellyfish package:
def get_closest_match(x, list_strings):
best_match = None
highest_jw = 0
for current_string in list_strings:
current_score = jellyfish.jaro_winkler(x, current_string)
if(current_score > highest_jw):
highest_jw = current_score
best_match = current_string
return best_match
df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))
df1.join(df2)
Output:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
http://pandas.pydata.org/pandas-docs/dev/merging.html does not have a hook function to do this on the fly. Would be nice though...
I would just do a separate step and use difflib getclosest_matches to create a new column in one of the 2 dataframes and the merge/join on the fuzzy matched column
I used Fuzzymatcher package and this worked well for me. Visit this link for more details on this.
use the below command to install
pip install fuzzymatcher
Below is the sample Code (already submitted by RobinL above)
from fuzzymatcher import link_table, fuzzy_left_join
# Columns to match on from df_left
left_on = ["fname", "mname", "lname", "dob"]
# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]
# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
Errors you may get
Pros :
cons:
Similar to @locojay suggestion, you can apply difflib's get_close_matches to df2
's index and then apply a join:
In [23]: import difflib
In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>
In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
In [26]: df2
Out[26]:
letter
one a
two b
three c
four d
five e
In [31]: df1.join(df2)
Out[31]:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
.
If these were columns, in the same vein you could apply to the column then merge:
df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])
df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)
For more complex use cases to match rows with many columns you can use recordlinkage
package. recordlinkage
provides all the tools to fuzzy match rows between pandas
data frames which helps to deduplicate your data when merging. I have written a detailed article about the package here