multiple merge operations on two dataframes using pandas

倖福魔咒の 提交于 2021-02-11 13:59:43

问题


I have two dataframes where multiple operations are to be implemented, for example:

old_DF

id   col1   col2    col3
-------------------------
1    aaa        
2           bbb     123

new_DF

id   col1   col2    col3
-------------------------
1           xxx      999
2    xxx    kkk 

The following operations need to be performed on these dataframes:

  1. Merging the two dataframes
  2. Replacing only the blanks (NAs) cells in the old_DF with corresponding values from new_DF
  3. Cells from both the dataframes where the values are contradicting should be reported in a new dataframe

Desired results:

updated_df

id   col1   col2    col3
-------------------------
1    aaa    xxx     999
2    xxx    bbb     123

conflicts_df

id   col1   col2    col3
-------------------------
2           bbb
2           kkk     

I can use .append() method to join the two dataframes and I guess one can use .bfil() or .ffil() methods to fill in the missing values. But I am unsuccessful with both .bfil() and .ffil(). I have tried df.groupby('id').apply(lambda x: x.ffill().bfill()).drop_duplicates() but I do not get the desired results. Additionally, I do not understand how to perform step 3 mentioned above. Is there anyone who can help with this problem?


回答1:


setting up:

old_df = pd.DataFrame([
  [1, 'aaa', pd.NA, pd.NA],
  [2, pd.NA, 'bbb', 123]],
  columns=['id', 'col1', 'col2', 'col3'])
new_df = pd.DataFrame([
  [1, pd.NA, 'xxx', 999],
  [2, 'xxx', 'kkk', pd.NA]],
  columns=['id', 'col1', 'col2', 'col3'])

Use combine_first to get the updated_df, setting id as the index

old_df = old_df.set_index('id')
new_df = new_df.set_index('id')
updated_df = old_df.combine_first(new_df)

# updated_df outputs:
# (reset the id if necessary)
   col1 col2 col3
id               
1   aaa  xxx  999
2   xxx  bbb  123

generate a dataframe of masks using boolean logic, checking that both the old & new frames have values in a given cell & that the values differ, and pick cells from both old & new using the mask where any row in the mask is True

mask = pd.notnull(new_df) & ~old_df.eq(new_df) & pd.notnull(old_df)
conflicts_df = pd.concat([old_df[mask], new_df[mask]]).dropna(how='all')

# conflicts_df outputs
   col1 col2 col3
id               
2   NaN  bbb  NaN
2   NaN  kkk  NaN


来源:https://stackoverflow.com/questions/61617200/multiple-merge-operations-on-two-dataframes-using-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!