问题
I have 2 dataframes currently, 1 for donors and 1 for fundraisers. Ideally what I'm trying to find is if any fundraisers also gave donations and if so copy some of that information into my fundraiser data set (donor name, email and their first donation). Problems with my data are 1) I need to match by name and email, but a user might have slightly different names (ex Kat and Kathy). 2) Duplicate names for donors and fundraisers. 2a) With donors I can get unique name/email combinations since I just care about the first donation date 2b) With fundraisers though I need to keep both rows and not lose data like the date.
Sample code I have right now:
import pandas as pd
import datetime
from fuzzywuzzy import fuzz
import difflib
donors = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Tom Smith","Jane Doe","Jane Doe","Kat test"]), "Email": pd.Series(['a@a.ca','a@a.ca','b@b.ca','c@c.ca','something@a.ca','d@d.ca']),"Date": (["27/03/2013 10:00:00 AM","1/03/2013 10:39:00 AM","2/03/2013 10:39:00 AM","3/03/2013 10:39:00 AM","4/03/2013 10:39:00 AM","27/03/2013 10:39:00 AM"])})
fundraisers = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Kathy test","Tes Ester", "Jane Doe"]),"Email": pd.Series(['a@a.ca','a@a.ca','d@d.ca','asdf@asdf.ca','something@a.ca']),"Date": pd.Series(["2/03/2013 10:39:00 AM","27/03/2013 11:39:00 AM","3/03/2013 10:39:00 AM","4/03/2013 10:40:00 AM","27/03/2013 10:39:00 AM"])})
donors["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
fundraisers["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
donors["code"] = donors.apply(lambda row: str(row['name'])+' '+str(row['Email']), axis=1)
idx = donors.groupby('code')["Date"].transform(min) == donors['Date']
donors = donors[idx].reset_index().drop('index',1)
So this leaves me with the first donation by each donor (assuming anyone with the exact same name and email is the same person).
ideally I want my fundraiser dataset to look like:
Date Email name Donor Name Donor Email Donor Date
2013-03-27 10:00:00 a@a.ca John Doe John Doe a@a.ca 2013-03-27 10:00:00
2013-01-03 10:39:00 a@a.ca John Doe John Doe a@a.ca 2013-03-27 10:00:00
2013-02-03 10:39:00 d@d.ca Kathy test Kat test d@d.ca 2013-03-27 10:39:00
2013-03-03 10:39:00 asdf@asdf.ca Tes Ester
2013-04-03 10:39:00 something@a.ca Jane Doe Jane Doe something@a.ca 2013-04-03 10:39:00
I tried following this thread: is it possible to do fuzzy match merge with python pandas? but keep getting index out of range errors (guessing it doesn't like the duplicated names in fundraisers) :( So any ideas how I can match/merge these datasets?
doing it with for loops (which works but is super slow and I feel there has to be a better way)
fundraisers["donor name"] = ""
fundraisers["donor email"] = ""
fundraisers["donor date"] = ""
for donindex in range(len(donors.index)):
max = 75
for funindex in range(len(fundraisers.index)):
aname = donors["name"][donindex]
comp = fundraisers["name"][funindex]
ratio = fuzz.ratio(aname, comp)
if ratio > max:
if (donors["Email"][donindex] == fundraisers["Email"][funindex]):
ratio *= 2
max = ratio
fundraisers["donor name"][funindex] = aname
fundraisers["donor email"][funindex] = donors["Email"][donindex]
fundraisers["donor date"][funindex] = donors["Date"][donindex]
回答1:
Here's a bit more pythonic (in my view), working (on your example) code, without explicit loops:
def get_donors(row):
d = donors.apply(lambda x: fuzz.ratio(x['name'], row['name']) * 2 if row['Email'] == x['Email'] else 1, axis=1)
d = d[d >= 75]
if len(d) == 0:
v = ['']*3
else:
v = donors.ix[d.idxmax(), ['name','Email','Date']].values
return pd.Series(v, index=['donor name', 'donor email', 'donor date'])
pd.concat((fundraisers, fundraisers.apply(get_donors, axis=1)), axis=1)
Output:
Date Email name donor name donor email donor date
0 2013-03-27 10:00:00 a@a.ca John Doe John Doe a@a.ca 2013-03-01 10:39:00
1 2013-03-01 10:39:00 a@a.ca John Doe John Doe a@a.ca 2013-03-01 10:39:00
2 2013-03-02 10:39:00 d@d.ca Kathy test Kat test d@d.ca 2013-03-27 10:39:00
3 2013-03-03 10:39:00 asdf@asdf.ca Tes Ester
4 2013-03-04 10:39:00 something@a.ca Jane Doe Jane Doe something@a.ca 2013-03-04 10:39:00
回答2:
I would use Jaro-Winkler, because it is one of the most performant and accurate approximate string matching algorithms currently available [Cohen, et al.], [Winkler].
This is how I would do it with Jaro-Winkler from the jellyfish package:
def get_closest_match(x, list_strings):
best_match = None
highest_jw = 0
for current_string in list_strings:
current_score = jellyfish.jaro_winkler(x, current_string)
if(current_score > highest_jw):
highest_jw = current_score
best_match = current_string
return best_match
df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))
df1.join(df2)
Output:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
Update: Use jaro_winkler from the Levenshtein module for improved performance.
from jellyfish import jaro_winkler as jf_jw
from Levenshtein import jaro_winkler as lv_jw
%timeit jf_jw("appel", "apple")
>> 339 ns ± 1.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit lv_jw("appel", "apple")
>> 193 ns ± 0.675 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
回答3:
How to identify Fuzzy duplication in DataFrame using Pandas
This my data frame
def get_ratio(row):
name = row['Name_1']
return fuzz.token_sort_ratio(name,"Ceylon Hotels Corporation")
df[df.apply(get_ratio, axis=1) > 70]
来源:https://stackoverflow.com/questions/19964546/python-pandas-fuzzy-merge-match-with-duplicates