问题
I need to merge two dataframes without creating duplicate columns. The first datframe (dfa) has missing values. The second dataframe (dfb) has unique values. This would be the same as a vlookup in Excel.
dfa looks like this:
postcode lat lon ...plus 32 more columns
M20 2.3 0.2
LS1 NaN NaN
LS1 NaN NaN
LS2 NaN NaN
M21 2.4 0.3
dfb only contains unique Postcodes and values where lat and lon were NaN in dfa. It looks like this:
postcode lat lon
LS1 1.4 0.1
LS2 1.5 0.2
The output I would like is:
postcode lat lon ...plus 32 more columns
M20 2.3 0.2
LS1 1.4 0.1
LS1 1.4 0.1
LS2 1.5 0.2
M21 2.4 0.3
I've tried using pd.merge like so:
outputdf = pd.merge(dfa, dfb, on='Postcode', how='left')
This results in duplicate columns being created:
postcode lat_x lon_x lat_y lat_x ...plus 32 more columns
M20 2.3 0.2 NaN NaN
LS1 NaN NaN 1.4 0.1
LS1 NaN NaN 1.4 0.1
LS2 NaN NaN 1.5 0.2
M21 2.4 0.3 NaN NaN
From this answer I tried using:
output = dfa
for df in [dfa, dfb]:
ouput.update(df.set_index('Postcode'))
But received the "ValueError: cannot reindex from a duplicate axis".
Also from the above answer this does not work:
output.merge(pd.concat([dfa, dfb]), how='left')
There are no duplicate columns but the values in 'Lat' and 'Lon' are still blank.
Is there a way to merge on 'Postcode' without duplicate columns being created; effectively performing a VLOOKUP using pandas?
回答1:
Use DataFrame.combine_first with indices by postcode
in both DataFrames and then if necessary add DataFrame.reindex for same order of columns like original df1
:
print (df1)
postcode lat lon plus 32 more columns
0 M20 2.3 0.2 NaN NaN NaN NaN
1 LS1 NaN NaN NaN NaN NaN NaN
2 LS1 NaN NaN NaN NaN NaN NaN
3 LS2 NaN NaN NaN NaN NaN NaN
4 M21 2.4 0.3 NaN NaN NaN NaN
df1 = df1.set_index('postcode')
df2 = df2.set_index('postcode')
df3 = df1.combine_first(df2).reindex(df1.columns, axis=1)
print (df3)
lat lon plus 32 more columns
postcode
LS1 1.4 0.1 NaN NaN NaN NaN
LS1 1.4 0.1 NaN NaN NaN NaN
LS2 1.5 0.2 NaN NaN NaN NaN
M20 2.3 0.2 NaN NaN NaN NaN
M21 2.4 0.3 NaN NaN NaN NaN
回答2:
DataFrame.combine_first(self, other) seems to be the best solution.
If you want one line of code and don't want to change input dataframes:
df1.set_index('postcode').combine_first(df2.set_index('postcode'))
and if you need to keep the index from df1:
df1.reset_index().set_index('postcode').combine_first(df2.set_index('postcode')).reset_index().set_index('index').sort_index()
Not to elegant, but works.
来源:https://stackoverflow.com/questions/57408583/pandas-merge-without-duplicating-columns