Merging two pandas dataframes results in “duplicate” columns

后端 未结 3 1053
隐瞒了意图╮
隐瞒了意图╮ 2020-12-30 10:46

I\'m trying to merge two dataframes which contain the same key column. Some of the other columns also have identical headers, although not an equal number of rows, and after

相关标签:
3条回答
  • 2020-12-30 11:03

    I have the same problem with duplicate columns after left joins even when the columns' data is identical. I did a query and found out that NaN values are considered different even if both columns are NaN in pandas 0.14. BUT once you upgrade to 0.15, this problem disappears, which explains why it later works for you, you probably upgraded.

    0 讨论(0)
  • 2020-12-30 11:08

    The reason you have additional columns with suffixes '_x' and '_y' is because the columns you are merging do not have matching values so this clash produces additional columns. In that case you need to drop the additional '_y' columns and rename the '_x' columns:

    In [145]:
    # define our drop function
    def drop_y(df):
        # list comprehension of the cols that end with '_y'
        to_drop = [x for x in df if x.endswith('_y')]
        df.drop(to_drop, axis=1, inplace=True)
    
    drop_y(merged)
    merged
    Out[145]:
                   key  dept_name_x  res_name_x   year_x   need   holding  \
    0  DeptA_ResA_2015        DeptA        ResA     2015      1         1   
    1  DeptA_ResA_2016        DeptA        ResA     2016      1         1   
    2  DeptA_ResA_2017        DeptA        ResA     2017      1         1   
    
        no_of_inv   inv_cost_wo_ice  
    0           1           1000000  
    1           0                 0  
    2           0                 0  
    In [146]:
    # func to rename '_x' cols
    def rename_x(df):
        for col in df:
            if col.endswith('_x'):
                df.rename(columns={col:col.rstrip('_x')}, inplace=True)
    rename_x(merged)
    merged
    Out[146]:
                   key  dept_name  res_name   year   need   holding   no_of_inv  \
    0  DeptA_ResA_2015      DeptA      ResA   2015      1         1           1   
    1  DeptA_ResA_2016      DeptA      ResA   2016      1         1           0   
    2  DeptA_ResA_2017      DeptA      ResA   2017      1         1           0   
    
        inv_cost_wo_ice  
    0           1000000  
    1                 0  
    2                 0 
    

    EDIT If you added the common columns to your merge then it shouldn't produce the duplicated columns unless the matches on those columns do not match:

    merge_df = pd.merge(holding_df, invest_df, on=['key', 'dept_name', 'res_name', 'year'], how='left').fillna(0)
    
    0 讨论(0)
  • 2020-12-30 11:18

    Not exactly the answer, but pd.merge provides an argument to help you decide which suffixes should be added to your overlapping columns:

    merge_df = pd.merge(holding_df, invest_df, on='key', how='left', suffixes=('_holding', '_invest')).fillna(0)
    

    More meaningful names could be helpful if you decide to keep both (or to check why the columns are kept).

    See documentation for more reference.

    0 讨论(0)
提交回复
热议问题