Pandas merge without duplicating columns

问题

I need to merge two dataframes without creating duplicate columns. The first datframe (dfa) has missing values. The second dataframe (dfb) has unique values. This would be the same as a vlookup in Excel.

dfa looks like this:

postcode  lat  lon ...plus 32 more columns
M20       2.3  0.2
LS1       NaN  NaN
LS1       NaN  NaN
LS2       NaN  NaN
M21       2.4  0.3

dfb only contains unique Postcodes and values where lat and lon were NaN in dfa. It looks like this:

postcode  lat  lon 
LS1       1.4  0.1
LS2       1.5  0.2

The output I would like is:

postcode  lat  lon ...plus 32 more columns
M20       2.3  0.2
LS1       1.4  0.1
LS1       1.4  0.1
LS2       1.5  0.2
M21       2.4  0.3

I've tried using pd.merge like so:

outputdf = pd.merge(dfa, dfb, on='Postcode', how='left')

This results in duplicate columns being created:

postcode  lat_x  lon_x  lat_y  lat_x ...plus 32 more columns
M20       2.3    0.2    NaN    NaN
LS1       NaN    NaN    1.4    0.1
LS1       NaN    NaN    1.4    0.1
LS2       NaN    NaN    1.5    0.2
M21       2.4    0.3    NaN    NaN

From this answer I tried using:

output = dfa
for df in [dfa, dfb]:
    ouput.update(df.set_index('Postcode'))

But received the "ValueError: cannot reindex from a duplicate axis".

Also from the above answer this does not work:

output.merge(pd.concat([dfa, dfb]), how='left')

There are no duplicate columns but the values in 'Lat' and 'Lon' are still blank.

Is there a way to merge on 'Postcode' without duplicate columns being created; effectively performing a VLOOKUP using pandas?

回答1:

Use DataFrame.combine_first with indices by postcode in both DataFrames and then if necessary add DataFrame.reindex for same order of columns like original df1:

print (df1)
  postcode  lat  lon  plus  32  more  columns
0      M20  2.3  0.2   NaN NaN   NaN      NaN
1      LS1  NaN  NaN   NaN NaN   NaN      NaN
2      LS1  NaN  NaN   NaN NaN   NaN      NaN
3      LS2  NaN  NaN   NaN NaN   NaN      NaN
4      M21  2.4  0.3   NaN NaN   NaN      NaN

df1 = df1.set_index('postcode')
df2 = df2.set_index('postcode')

df3 = df1.combine_first(df2).reindex(df1.columns, axis=1)
print (df3)
          lat  lon  plus  32  more  columns
postcode                                   
LS1       1.4  0.1   NaN NaN   NaN      NaN
LS1       1.4  0.1   NaN NaN   NaN      NaN
LS2       1.5  0.2   NaN NaN   NaN      NaN
M20       2.3  0.2   NaN NaN   NaN      NaN
M21       2.4  0.3   NaN NaN   NaN      NaN

回答2:

DataFrame.combine_first(self, other) seems to be the best solution.

If you want one line of code and don't want to change input dataframes:

 df1.set_index('postcode').combine_first(df2.set_index('postcode'))

and if you need to keep the index from df1:

df1.reset_index().set_index('postcode').combine_first(df2.set_index('postcode')).reset_index().set_index('index').sort_index()

Not to elegant, but works.

来源：https://stackoverflow.com/questions/57408583/pandas-merge-without-duplicating-columns

标签

python

pandas

dataframe

merge