Combine duplicated columns within a DataFrame

前端 未结 3 1427
礼貌的吻别
礼貌的吻别 2020-12-08 10:19

If I have a dataframe that has columns that include the same name, is there a way to combine the columns that have the same name with some sort of function (i.e. sum)?

相关标签:
3条回答
  • 2020-12-08 10:56

    Here is possible simplier solution for common aggregation functions like sum, mean, median, max, min, std - only use parameters axis=1 for working with columns and level:

    #coldspeed samples
    np.random.seed(0)
    df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
    print (df)
    
    print (df.sum(axis=1, level=0))
        A    B
    0  91    6
    1  48   76
    2  29   60
    3  39  108
    4  41   75
    
    df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
    
    print (df.sum(axis=1, level=1))
        A    B
    0  91    6
    1  48   76
    2  29   60
    3  39  108
    4  41   75
    
    print (df.sum(axis=1, level=[0,1]))
      one     two
        A   B   B
    0  91   0   6
    1  48  19  57
    2  29  24  36
    3  39  39  69
    4  41  37  38
    

    Similar it working for index, then use axis=0 instead axis=1:

    np.random.seed(0)
    df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('ABCDE'), index=list('aabbc'))
    print (df)
        A   B   C   D   E
    a  44  47   0   3   3
    a  39   9  19  21  36
    b  23   6  24  24  12
    b   1  38  39  23  46
    c  24  17  37  25  13
    
    print (df.min(axis=0, level=0))
        A   B   C   D   E
    a  39   9   0   3   3
    b   1   6  24  23  12
    c  24  17  37  25  13
    
    df.index = pd.MultiIndex.from_arrays([['bar']*3 + ['foo']*2, df.index])
    
    print (df.mean(axis=0, level=1))
          A     B     C     D     E
    a  41.5  28.0   9.5  12.0  19.5
    b  12.0  22.0  31.5  23.5  29.0
    c  24.0  17.0  37.0  25.0  13.0
    
    print (df.max(axis=0, level=[0,1]))
            A   B   C   D   E
    bar a  44  47  19  21  36
        b  23   6  24  24  12
    foo b   1  38  39  23  46
        c  24  17  37  25  13
    

    If need use another functions like first, last, size, count is necessary use coldspeed answer

    0 讨论(0)
  • 2020-12-08 11:02

    pandas >= 0.20: df.groupby(level=0, axis=1)

    You don't need a lambda here, nor do you explicitly have to query df.columns; groupby accepts a level argument you can specify in conjunction with the axis argument. This is cleaner, IMO.

    # Setup
    np.random.seed(0)
    df = pd.DataFrame(np.random.choice(50, (5, 5)), columns=list('AABBB'))
    df
    
        A   A   B   B   B
    0  44  47   0   3   3
    1  39   9  19  21  36
    2  23   6  24  24  12
    3   1  38  39  23  46
    4  24  17  37  25  13
    

    <!_ >

    df.groupby(level=0, axis=1).sum()
    
        A    B
    0  91    6
    1  48   76
    2  29   60
    3  39  108
    4  41   75
    

    Handling MultiIndex columns

    Another case to consider is when dealing with MultiIndex columns. Consider

    df.columns = pd.MultiIndex.from_arrays([['one']*3 + ['two']*2, df.columns])
    df
      one         two    
        A   A   B   B   B
    0  44  47   0   3   3
    1  39   9  19  21  36
    2  23   6  24  24  12
    3   1  38  39  23  46
    4  24  17  37  25  13
    

    To perform aggregation across the upper levels, use

    df.groupby(level=1, axis=1).sum()
    
        A    B
    0  91    6
    1  48   76
    2  29   60
    3  39  108
    4  41   75
    

    or, if aggregating per upper level only, use

    df.groupby(level=[0, 1], axis=1).sum()
    
      one     two
        A   B   B
    0  91   0   6
    1  48  19  57
    2  29  24  36
    3  39  39  69
    4  41  37  38
    

    Alternate Interpretation: Dropping Duplicate Columns

    If you came here looking to find out how to simply drop duplicate columns (without performing any aggregation), use Index.duplicated:

    df.loc[:,~df.columns.duplicated()]
    
        A   B
    0  44   0
    1  39  19
    2  23  24
    3   1  39
    4  24  37
    

    Or, to keep the last ones, specify keep='last' (default is 'first'),

    df.loc[:,~df.columns.duplicated(keep='last')]
    
        A   B
    0  47   3
    1   9  36
    2   6  12
    3  38  46
    4  17  13
    

    The groupby alternatives for the two solutions above are df.groupby(level=0, axis=1).first(), and ... .last(), respectively.

    0 讨论(0)
  • 2020-12-08 11:16

    I believe this does what you are after:

    df.groupby(lambda x:x, axis=1).sum()
    

    Alternatively, between 3% and 15% faster depending on the length of the df:

    df.groupby(df.columns, axis=1).sum()
    

    EDIT: To extend this beyond sums, use .agg() (short for .aggregate()):

    df.groupby(df.columns, axis=1).agg(numpy.max)
    
    0 讨论(0)
提交回复
热议问题