How can I create an artificial key column for merging two datasets using difflab when the column of interest has missing cells?

问题

Goal: If the name in df2 in row i is a sub-string or an exact match of a name in df1 in some row N and the state and district columns of row N in df1 are a match to the respective state and district columns of df2 row i, combine.

I was recommended of using difflib to create an artificial key column to merge on.

This new column is called 'name'. difflib.get_close_matches looks for similar strings in df2.

This works well when all rows in the 'CandidateName' column are present but I get IndexError: list index out of range when a cell is missing.

I tried resolving this issue by filling in the empty column with the string 'EMPTY'. However the same error still occurs.

# I used this method to replace empty cells
df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')


# I then proceeded to run the line again
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])

# Data Frame Samples

# Data Frame 1
CandidateName = ['Theodorick A. Bland','Aedanus Rutherford Burke','Jason Lewis','Barbara  Comstock','Theodorick Bland','Aedanus Burke','Jason Initial Lewis', '','']
State = ['VA', 'SC', 'MN','VA','VA', 'SC', 'MN','NH','NH']
District = [9,2,2,10,9,2,2,1,1]
Party = ['','', '','Democrat','','','Democrat','Whig','Whig']
data1 = {'CandidateName':CandidateName, 'State':State, 'District':District,'Party':Party }
df1 = pd.DataFrame(data = data1)

print df1

#        CandidateName         District   Party          State
#0  Theodorick A. Bland           9                       VA
#1  Aedanus Rutherford Burke      2                       SC
#2  Jason Lewis                   2       Democrat        MN
#3  Barbara Comstock             10       Democrat        VA
#4  Theodorick Bland              9                       VA
#5  Aedanus Burke                 2                       SC
#6  Jason Initial Lewis           2         Democrat      MN
#7  ''                            1         Whig          NH
#8  ''                            1         Whig          NH

Name = ['Theodorick Bland','Aedanus Burke','Jason Lewis', 'Barbara Comstock']
State = ['VA', 'SC', 'MN','VA']
District = [9,2,2,10]
Party = ['','', 'Democrat','Democrat']
data2 = {'Name':Name, 'State':State, 'District':District, 'Party':Party}
df2 = pd.DataFrame(data = data2)

print df2

#   CandidateName        District   Party      State
#0  Theodorick Bland        9                   VA
#1  Aedanus Burke           2                   SC
#2  Jason Lewis             2       Democrat    MN
#3  Barbara Comstock        10      Democrat    VA

import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])

df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'])

Expected

print(df1)
#              CandidateName State  District     Party              Name
#0       Theodorick A. Bland    VA         9            Theodorick Bland
#1  Aedanus Rutherford Burke    SC         2               Aedanus Burke
#2               Jason Lewis    MN         2                 Jason Lewis
#3         Barbara  Comstock    VA        10  Democrat  Barbara Comstock
#4          Theodorick Bland    VA         9            Theodorick Bland
#5             Aedanus Burke    SC         2               Aedanus Burke
#6       Jason Initial Lewis    MN         2  Democrat       Jason Lewis
#7                              NH         1      Whig    
#8                              NH         1      Whig

Actual Error Result:

-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
---> 23 df1['Name'] = df1['CandidateName'].apply(lambda x: difflib.get_close_matches(x, df2['Name'])[0])

IndexError: list index out of range

回答1:

You are getting a list type object back. And these lists dont have index 0. Thats why you get this error. Second of all, we need to convert these lists to type string to be able to do the merge like following:

note: you dont have to use: df1['CandidateName'] = df1['CandidateName'].replace('', 'EMPTY')

import difflib
df1['Name'] = df1['CandidateName'].apply(lambda x: ''.join(difflib.get_close_matches(x, df2['Name'])))

df_merge = df1.merge(df2.drop('Party', axis=1), on=['Name', 'State', 'District'], how='left')

print(df_merge)
              CandidateName State  District     Party              Name
0       Theodorick A. Bland    VA         9            Theodorick Bland
1  Aedanus Rutherford Burke    SC         2               Aedanus Burke
2               Jason Lewis    MN         2                 Jason Lewis
3         Barbara  Comstock    VA        10  Democrat  Barbara Comstock
4          Theodorick Bland    VA         9            Theodorick Bland
5             Aedanus Burke    SC         2               Aedanus Burke
6       Jason Initial Lewis    MN         2  Democrat       Jason Lewis
7                              NH         1      Whig                  
8                              NH         1      Whig

Note I added how='left' argument to our merge since you want to keep the shape of your original dataframe.

Explanation of ''.join()
We do this to convert the list to string, see example:

lst = ['hello', 'world']

print(' '.join(lst))
'hello world'

来源：https://stackoverflow.com/questions/55445922/how-can-i-create-an-artificial-key-column-for-merging-two-datasets-using-difflab

标签

python

regex

pandas

python-2.7

difflib