Pandas Merge - How to avoid duplicating columns

前端 未结 5 1405
轮回少年
轮回少年 2020-11-28 02:23

I am attempting a merge between two data frames. Each data frame has two index levels (date, cusip). In the columns, some columns match between the two (currency, adj date

相关标签:
5条回答
  • 2020-11-28 02:43

    I use the suffixes option in .merge():

    dfNew = df.merge(df2, left_index=True, right_index=True,
                     how='outer', suffixes=('', '_y'))
    dfNew.drop(dfNew.filter(regex='_y$').columns.tolist(),axis=1, inplace=True)
    

    Thanks @ijoseph

    0 讨论(0)
  • 2020-11-28 02:54

    You can work out the columns that are only in one DataFrame and use this to select a subset of columns in the merge.

    cols_to_use = df2.columns.difference(df.columns)
    

    Then perform the merge (note this is an index object but it has a handy tolist() method).

    dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')
    

    This will avoid any columns clashing in the merge.

    0 讨论(0)
  • 2020-11-28 02:54

    Building on @rprog's answer, you can combine the various pieces of the suffix & filter step into one line using a negative regex:

    dfNew = df.merge(df2, left_index=True, right_index=True,
                 how='outer', suffixes=('', '_DROP')).filter(regex='^(?!.*_DROP)')
    

    Or using df.join:

    dfNew = df.join(df2, lsuffix="DROP").filter(regex="^(?!.*DROP)")
    

    The regex here is keeping anything that does not end with the word "DROP", so just make sure to use a suffix that doesn't appear among the columns already.

    0 讨论(0)
  • 2020-11-28 02:54

    This is a bit of going around the problem, but I have written a function that basically deals with the extra columns:

    def merge_fix_cols(df_company,df_product,uniqueID):
        
        df_merged = pd.merge(df_company,
                             df_product,
                             how='left',left_on=uniqueID,right_on=uniqueID)    
        for col in df_merged:
            if col.endswith('_x'):
                df_merged.rename(columns = lambda col:col.rstrip('_x'),inplace=True)
            elif col.endswith('_y'):
                to_drop = [col for col in df_merged if col.endswith('_y')]
                df_merged.drop(to_drop,axis=1,inplace=True)
            else:
                pass
        return df_merged
    

    Seems to work well with my merges!

    0 讨论(0)
  • 2020-11-28 02:55

    I'm freshly new with Pandas but I wanted to achieve the same thing, automatically avoiding column names with _x or _y and removing duplicate data. I finally did it by using this answer and this one from Stackoverflow

    sales.csv

        city;state;units
        Mendocino;CA;1
        Denver;CO;4
        Austin;TX;2
    

    revenue.csv

        branch_id;city;revenue;state_id
        10;Austin;100;TX
        20;Austin;83;TX
        30;Austin;4;TX
        47;Austin;200;TX
        20;Denver;83;CO
        30;Springfield;4;I
    

    merge.py import pandas

    def drop_y(df):
        # list comprehension of the cols that end with '_y'
        to_drop = [x for x in df if x.endswith('_y')]
        df.drop(to_drop, axis=1, inplace=True)
    
    
    sales = pandas.read_csv('data/sales.csv', delimiter=';')
    revenue = pandas.read_csv('data/revenue.csv', delimiter=';')
    
    result = pandas.merge(sales, revenue,  how='inner', left_on=['state'], right_on=['state_id'], suffixes=('', '_y'))
    drop_y(result)
    result.to_csv('results/output.csv', index=True, index_label='id', sep=';')
    

    When executing the merge command I replace the _x suffix with an empty string and them I can remove columns ending with _y

    output.csv

        id;city;state;units;branch_id;revenue;state_id
        0;Denver;CO;4;20;83;CO
        1;Austin;TX;2;10;100;TX
        2;Austin;TX;2;20;83;TX
        3;Austin;TX;2;30;4;TX
        4;Austin;TX;2;47;200;TX
    
    0 讨论(0)
提交回复
热议问题