问题
Goal: If the name in df2 in row i is a sub-string or an exact match of a name in df1 in some row N and the state and district columns of row N in df1 are a match to the respective state and district columns of df2 row i, combine.
I was recommended of using difflib to create an artificial key column to merge on.
This new column is called 'name'. difflib.get_close_matches looks for similar strings in df2.
This works well when all rows in the 'CandidateName' column are present but I get IndexError: list index out of range when a cell is missing.
I tried resolving this issue by filling in the empty column with the string 'EMPTY'. However the same error still occurs.
# I used this method to replace empty cells
df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')
# I then proceeded to run the line again
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
# Data Frame Samples
# Data Frame 1
CandidateName = ['Theodorick A. Bland','Aedanus Rutherford Burke','Jason Lewis','Barbara Comstock','Theodorick Bland','Aedanus Burke','Jason Initial Lewis', '','']
State = ['VA', 'SC', 'MN','VA','VA', 'SC', 'MN','NH','NH']
District = [9,2,2,10,9,2,2,1,1]
Party = ['','', '','Democrat','','','Democrat','Whig','Whig']
data1 = {'CandidateName':CandidateName, 'State':State, 'District':District,'Party':Party }
df1 = pd.DataFrame(data = data1)
print df1
# CandidateName District Party State
#0 Theodorick A. Bland 9 VA
#1 Aedanus Rutherford Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
#4 Theodorick Bland 9 VA
#5 Aedanus Burke 2 SC
#6 Jason Initial Lewis 2 Democrat MN
#7 '' 1 Whig NH
#8 '' 1 Whig NH
Name = ['Theodorick Bland','Aedanus Burke','Jason Lewis', 'Barbara Comstock']
State = ['VA', 'SC', 'MN','VA']
District = [9,2,2,10]
Party = ['','', 'Democrat','Democrat']
data2 = {'Name':Name, 'State':State, 'District':District, 'Party':Party}
df2 = pd.DataFrame(data = data2)
print df2
# CandidateName District Party State
#0 Theodorick Bland 9 VA
#1 Aedanus Burke 2 SC
#2 Jason Lewis 2 Democrat MN
#3 Barbara Comstock 10 Democrat VA
import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'])
Expected
print(df1)
# CandidateName State District Party Name
#0 Theodorick A. Bland VA 9 Theodorick Bland
#1 Aedanus Rutherford Burke SC 2 Aedanus Burke
#2 Jason Lewis MN 2 Jason Lewis
#3 Barbara Comstock VA 10 Democrat Barbara Comstock
#4 Theodorick Bland VA 9 Theodorick Bland
#5 Aedanus Burke SC 2 Aedanus Burke
#6 Jason Initial Lewis MN 2 Democrat Jason Lewis
#7 NH 1 Whig
#8 NH 1 Whig
Actual Error Result:
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
---> 23 df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])
IndexError: list index out of range
回答1:
You are getting a list
type object back. And these lists dont have index 0
. Thats why you get this error. Second of all, we need to convert these lists
to type string
to be able to do the merge like following:
note: you dont have to use: df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')
import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: ''.join(difflib.get_close_matches(x, df2['Name'])))
df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'], how='left')
print(df_merge)
CandidateName State District Party Name
0 Theodorick A. Bland VA 9 Theodorick Bland
1 Aedanus Rutherford Burke SC 2 Aedanus Burke
2 Jason Lewis MN 2 Jason Lewis
3 Barbara Comstock VA 10 Democrat Barbara Comstock
4 Theodorick Bland VA 9 Theodorick Bland
5 Aedanus Burke SC 2 Aedanus Burke
6 Jason Initial Lewis MN 2 Democrat Jason Lewis
7 NH 1 Whig
8 NH 1 Whig
Note I added how='left'
argument to our merge
since you want to keep the shape of your original dataframe.
Explanation of ''.join()
We do this to convert the list to string, see example:
lst = ['hello', 'world']
print(' '.join(lst))
'hello world'
来源:https://stackoverflow.com/questions/55445922/how-can-i-create-an-artificial-key-column-for-merging-two-datasets-using-difflab