Pandas percentage of total with groupby

前端 未结 14 2284
没有蜡笔的小新
没有蜡笔的小新 2020-11-22 06:41

This is obviously simple, but as a numpy newbe I\'m getting stuck.

I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office

相关标签:
14条回答
  • 2020-11-22 07:20

    For conciseness I'd use the SeriesGroupBy:

    In [11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
    
    In [12]: c
    Out[12]:
    state  office_id
    AZ     2            925105
           4            592852
           6            362198
    CA     1            819164
           3            743055
           5            292885
    CO     1            525994
           3            338378
           5            490335
    WA     2            623380
           4            441560
           6            451428
    Name: count, dtype: int64
    
    In [13]: c / c.groupby(level=0).sum()
    Out[13]:
    state  office_id
    AZ     2            0.492037
           4            0.315321
           6            0.192643
    CA     1            0.441573
           3            0.400546
           5            0.157881
    CO     1            0.388271
           3            0.249779
           5            0.361949
    WA     2            0.411101
           4            0.291196
           6            0.297703
    Name: count, dtype: float64
    

    For multiple groups you have to use transform (using Radical's df):

    In [21]: c =  df.groupby(["Group 1","Group 2","Final Group"])["Numbers I want as percents"].sum().rename("count")
    
    In [22]: c / c.groupby(level=[0, 1]).transform("sum")
    Out[22]:
    Group 1  Group 2  Final Group
    AAHQ     BOSC     OWON           0.331006
                      TLAM           0.668994
             MQVF     BWSI           0.288961
                      FXZM           0.711039
             ODWV     NFCH           0.262395
    ...
    Name: count, dtype: float64
    

    This seems to be slightly more performant than the other answers (just less than twice the speed of Radical's answer, for me ~0.08s).

    0 讨论(0)
  • 2020-11-22 07:21

    (This solution is inspired from this article https://pbpython.com/pandas_transform.html)

    I find the following solution to be the simplest(and probably the fastest) using transformation:

    Transformation: While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input.

    So using transformation, the solution is 1-liner:

    df['%'] = 100 * df['sales'] / df.groupby('state')['sales'].transform('sum')
    

    And if you print:

    print(df.sort_values(['state', 'office_id']).reset_index(drop=True))
    
       state  office_id   sales          %
    0     AZ          2  195197   9.844309
    1     AZ          4  877890  44.274352
    2     AZ          6  909754  45.881339
    3     CA          1  614752  50.415708
    4     CA          3  395340  32.421767
    5     CA          5  209274  17.162525
    6     CO          1  549430  42.659629
    7     CO          3  457514  35.522956
    8     CO          5  280995  21.817415
    9     WA          2  828238  35.696929
    10    WA          4  719366  31.004563
    11    WA          6  772590  33.298509
    
    0 讨论(0)
提交回复
热议问题