I wonder whether there is the fastest code to replace the two for loops, assuming the df size is very large. In my real case, each dataframe is 200 rows and 25 columns.
Another method with merge and drop duplicates based on transposed dataframe and ffill i.e
new_df = df1.merge(df2,on=[0],how='outer').T.set_index(0).sort_index()
.ffill().reset_index().drop_duplicates(0,keep='last').T.dropna()
0 2 3 5 0 Attribute Date Name Unit 1 1 2019 a F 2 2 2020 b G 3 3 2016 c C 4 4 2017 d D 5 5 2021 e H
Explanation
df1.merge(df2,on=[0],how='outer').T.set_index(0).sort_index()
Transposed dataframe will give the dataframe such that we can apply ffill to fill the nan values
1 2 3 4 5 6 0 Attribute 1 2 3 4 5 NaN Date 2014 2015 2016 2017 2018 NaN Date 2019 2020 NaN NaN 2021 2022 Name a b c d e f Unit A B C D E NaN Unit F G NaN NaN H I
.ffill().reset_index().drop_duplicates(0,keep='last')
This will fill the nan values with previous rows data and reset_index with drop duplicates of subset 0 and keep the last will keep the completely filled row.
0 1 2 3 4 5 6 0 Attribute 1 2 3 4 5 NaN 2 Date 2019 2020 2016 2017 2021 2022 3 Name a b c d e f 5 Unit F G C D H I
.T.dropna()
This will rotate the dataframe remove rows with nan values resulting in desired output .
I also figured out that below code does what I want and is much faster than two for loops.
df1.loc[[1,2,5],[1,3]] = df2.loc[[1,2,3],[1,2]].values
Some cleaning:
def clean_df(df):
df.columns = df.iloc[0]
df.columns.name = None
df = df.iloc[1:].reset_index()
return df
df1 = clean_df(df1)
df1
index Name Unit Attribute Date
0 1 a A 1 2014
1 2 b B 2 2015
2 3 c C 3 2016
3 4 d D 4 2017
4 5 e E 5 2018
df2 = clean_df(df2)
df2
index Name Unit Date
0 1 a F 2019
1 2 b G 2020
2 3 e H 2021
3 4 f I 2022
Use merge
, specifying on=Name
, so the other columns are not considered.
cols = ['Name', 'Unit_y', 'Attribute', 'Date_y']
df1 = df1.merge(df2, how='left', on='Name')[cols]\
.rename(columns=lambda x: x.split('_')[0]).fillna(df1)
df1
Name Unit Attribute Date
0 a F 1 2019
1 b G 2 2020
2 c C 3 2016
3 d D 4 2017
4 e H 5 2021