Merge and update dataframes based on a subset of their columns

后端未结

关注

 3  2009

I wonder whether there is the fastest code to replace the two for loops, assuming the df size is very large. In my real case, each dataframe is 200 rows and 25 columns.

相关标签:

3条回答

故里飘歌

2021-01-20 02:15

Another method with merge and drop duplicates based on transposed dataframe and ffill i.e

new_df = df1.merge(df2,on=[0],how='outer').T.set_index(0).sort_index()
        .ffill().reset_index().drop_duplicates(0,keep='last').T.dropna()

           0     2     3     5
0  Attribute  Date  Name  Unit
1          1  2019     a     F
2          2  2020     b     G
3          3  2016     c     C
4          4  2017     d     D
5          5  2021     e     H

Explanation

df1.merge(df2,on=[0],how='outer').T.set_index(0).sort_index()

Transposed dataframe will give the dataframe such that we can apply ffill to fill the nan values

            1     2     3     4     5     6
0                                            
Attribute     1     2     3     4     5   NaN
Date       2014  2015  2016  2017  2018   NaN
Date       2019  2020   NaN   NaN  2021  2022
Name          a     b     c     d     e     f
Unit          A     B     C     D     E   NaN
Unit          F     G   NaN   NaN     H     I

.ffill().reset_index().drop_duplicates(0,keep='last')

This will fill the nan values with previous rows data and reset_index with drop duplicates of subset 0 and keep the last will keep the completely filled row.

         0     1     2     3     4     5     6
0  Attribute     1     2     3     4     5   NaN
2       Date  2019  2020  2016  2017  2021  2022
3       Name     a     b     c     d     e     f
5       Unit     F     G     C     D     H     I

.T.dropna()

This will rotate the dataframe remove rows with nan values resulting in desired output .

0 讨论(0)

醉酒成梦

2021-01-20 02:34
I also figured out that below code does what I want and is much faster than two for loops.
```
df1.loc[[1,2,5],[1,3]] = df2.loc[[1,2,3],[1,2]].values
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

甜味超标

2021-01-20 02:39

Some cleaning:

def clean_df(df):
    df.columns = df.iloc[0]
    df.columns.name = None        
    df = df.iloc[1:].reset_index()

    return df

df1 = clean_df(df1)
df1
   index Name Unit Attribute  Date
0      1    a    A         1  2014
1      2    b    B         2  2015
2      3    c    C         3  2016
3      4    d    D         4  2017
4      5    e    E         5  2018

df2 = clean_df(df2)
df2    
   index Name Unit  Date
0      1    a    F  2019
1      2    b    G  2020
2      3    e    H  2021
3      4    f    I  2022

Use merge, specifying on=Name, so the other columns are not considered.

cols = ['Name', 'Unit_y', 'Attribute', 'Date_y']
df1 = df1.merge(df2, how='left', on='Name')[cols]\
              .rename(columns=lambda x: x.split('_')[0]).fillna(df1)

df1
  Name Unit Attribute  Date
0    a    F         1  2019
1    b    G         2  2020
2    c    C         3  2016
3    d    D         4  2017
4    e    H         5  2021

0 讨论(0)