Merge and update dataframes based on a subset of their columns

后端 未结 3 2009
梦谈多话
梦谈多话 2021-01-20 01:45

I wonder whether there is the fastest code to replace the two for loops, assuming the df size is very large. In my real case, each dataframe is 200 rows and 25 columns.

相关标签:
3条回答
  • 2021-01-20 02:15

    Another method with merge and drop duplicates based on transposed dataframe and ffill i.e

    new_df = df1.merge(df2,on=[0],how='outer').T.set_index(0).sort_index()
            .ffill().reset_index().drop_duplicates(0,keep='last').T.dropna()
    
               0     2     3     5
    0  Attribute  Date  Name  Unit
    1          1  2019     a     F
    2          2  2020     b     G
    3          3  2016     c     C
    4          4  2017     d     D
    5          5  2021     e     H
    

    Explanation

    df1.merge(df2,on=[0],how='outer').T.set_index(0).sort_index()
    

    Transposed dataframe will give the dataframe such that we can apply ffill to fill the nan values

                1     2     3     4     5     6
    0                                            
    Attribute     1     2     3     4     5   NaN
    Date       2014  2015  2016  2017  2018   NaN
    Date       2019  2020   NaN   NaN  2021  2022
    Name          a     b     c     d     e     f
    Unit          A     B     C     D     E   NaN
    Unit          F     G   NaN   NaN     H     I
    
    .ffill().reset_index().drop_duplicates(0,keep='last')
    

    This will fill the nan values with previous rows data and reset_index with drop duplicates of subset 0 and keep the last will keep the completely filled row.

             0     1     2     3     4     5     6
    0  Attribute     1     2     3     4     5   NaN
    2       Date  2019  2020  2016  2017  2021  2022
    3       Name     a     b     c     d     e     f
    5       Unit     F     G     C     D     H     I
    
    .T.dropna()
    

    This will rotate the dataframe remove rows with nan values resulting in desired output .

    0 讨论(0)
  • 2021-01-20 02:34

    I also figured out that below code does what I want and is much faster than two for loops.

    df1.loc[[1,2,5],[1,3]] = df2.loc[[1,2,3],[1,2]].values
    
    0 讨论(0)
  • 2021-01-20 02:39

    Some cleaning:

    def clean_df(df):
        df.columns = df.iloc[0]
        df.columns.name = None        
        df = df.iloc[1:].reset_index()
    
        return df
    
    df1 = clean_df(df1)
    df1
       index Name Unit Attribute  Date
    0      1    a    A         1  2014
    1      2    b    B         2  2015
    2      3    c    C         3  2016
    3      4    d    D         4  2017
    4      5    e    E         5  2018
    
    df2 = clean_df(df2)
    df2    
       index Name Unit  Date
    0      1    a    F  2019
    1      2    b    G  2020
    2      3    e    H  2021
    3      4    f    I  2022
    

    Use merge, specifying on=Name, so the other columns are not considered.

    cols = ['Name', 'Unit_y', 'Attribute', 'Date_y']
    df1 = df1.merge(df2, how='left', on='Name')[cols]\
                  .rename(columns=lambda x: x.split('_')[0]).fillna(df1)
    
    df1
      Name Unit Attribute  Date
    0    a    F         1  2019
    1    b    G         2  2020
    2    c    C         3  2016
    3    d    D         4  2017
    4    e    H         5  2021
    
    0 讨论(0)
提交回复
热议问题