Pandas percentage of total with groupby

前端 未结 14 2285
没有蜡笔的小新
没有蜡笔的小新 2020-11-22 06:41

This is obviously simple, but as a numpy newbe I\'m getting stuck.

I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office

相关标签:
14条回答
  • 2020-11-22 07:05

    I think this needs benchmarking. Using OP's original DataFrame,

    df = pd.DataFrame({
        'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
        'office_id': range(1, 7) * 2,
        'sales': [np.random.randint(100000, 999999) for _ in range(12)]
    })
    

    1st Andy Hayden

    As commented on his answer, Andy takes full advantage of vectorisation and pandas indexing.

    c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
    c / c.groupby(level=0).sum()
    

    3.42 ms ± 16.7 µs per loop
    (mean ± std. dev. of 7 runs, 100 loops each)


    2nd Paul H

    state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
    state = df.groupby(['state']).agg({'sales': 'sum'})
    state_office.div(state, level='state') * 100
    

    4.66 ms ± 24.4 µs per loop
    (mean ± std. dev. of 7 runs, 100 loops each)


    3rd exp1orer

    This is the slowest answer as it calculates x.sum() for each x in level 0.

    For me, this is still a useful answer, though not in its current form. For quick EDA on smaller datasets, apply allows you use method chaining to write this in a single line. We therefore remove the need decide on a variable's name, which is actually very computationally expensive for your most valuable resource (your brain!!).

    Here is the modification,

    (
        df.groupby(['state', 'office_id'])
        .agg({'sales': 'sum'})
        .groupby(level=0)
        .apply(lambda x: 100 * x / float(x.sum()))
    )
    

    10.6 ms ± 81.5 µs per loop
    (mean ± std. dev. of 7 runs, 100 loops each)


    So no one is going care about 6ms on a small dataset. However, this is 3x speed up and, on a larger dataset with high cardinality groupbys this is going to make a massive difference.

    Adding to the above code, we make a DataFrame with shape (12,000,000, 3) with 14412 state categories and 600 office_ids,

    import string
    
    import numpy as np
    import pandas as pd
    np.random.seed(0)
    
    groups = [
        ''.join(i) for i in zip(
        np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
        np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
        np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
                           )
    ]
    
    df = pd.DataFrame({'state': groups * 400,
                   'office_id': list(range(1, 601)) * 20000,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)] * 1000000
    })
    

    Using Andy's,

    2 s ± 10.4 ms per loop
    (mean ± std. dev. of 7 runs, 1 loop each)

    and exp1orer

    19 s ± 77.1 ms per loop
    (mean ± std. dev. of 7 runs, 1 loop each)

    So now we see x10 speed up on large, high cardinality datasets.


    Be sure to UV these three answers if you UV this one!!

    0 讨论(0)
  • 2020-11-22 07:05

    I think this would do the trick in 1 line:

    df.groupby(['state', 'office_id']).sum().transform(lambda x: x/np.sum(x)*100)
    
    0 讨论(0)
  • 2020-11-22 07:08

    As someone who is also learning pandas I found the other answers a bit implicit as pandas hides most of the work behind the scenes. Namely in how the operation works by automatically matching up column and index names. This code should be equivalent to a step by step version of @exp1orer's accepted answer

    With the df, I'll call it by the alias state_office_sales:

                      sales
    state office_id        
    AZ    2          839507
          4          373917
          6          347225
    CA    1          798585
          3          890850
          5          454423
    CO    1          819975
          3          202969
          5          614011
    WA    2          163942
          4          369858
          6          959285
    

    state_total_sales is state_office_sales grouped by total sums in index level 0 (leftmost).

    In:   state_total_sales = df.groupby(level=0).sum()
          state_total_sales
    
    Out: 
           sales
    state   
    AZ     2448009
    CA     2832270
    CO     1495486
    WA     595859
    

    Because the two dataframes share an index-name and a column-name pandas will find the appropriate locations through shared indexes like:

    In:   state_office_sales / state_total_sales
    
    Out:  
    
                       sales
    state   office_id   
    AZ      2          0.448640
            4          0.125865
            6          0.425496
    CA      1          0.288022
            3          0.322169
            5          0.389809
    CO      1          0.206684
            3          0.357891
            5          0.435425
    WA      2          0.321689
            4          0.346325
            6          0.331986
    

    To illustrate this even better, here is a partial total with a XX that has no equivalent. Pandas will match the location based on index and column names, where there is no overlap pandas will ignore it:

    In:   partial_total = pd.DataFrame(
                          data   =  {'sales' : [2448009, 595859, 99999]},
                          index  =             ['AZ',    'WA',   'XX' ]
                          )
          partial_total.index.name = 'state'
    
    
    Out:  
             sales
    state
    AZ       2448009
    WA       595859
    XX       99999
    
    In:   state_office_sales / partial_total
    
    Out: 
                       sales
    state   office_id   
    AZ      2          0.448640
            4          0.125865
            6          0.425496
    CA      1          NaN
            3          NaN
            5          NaN
    CO      1          NaN
            3          NaN
            5          NaN
    WA      2          0.321689
            4          0.346325
            6          0.331986
    

    This becomes very clear when there are no shared indexes or columns. Here missing_index_totals is equal to state_total_sales except that it has a no index-name.

    In:   missing_index_totals = state_total_sales.rename_axis("")
          missing_index_totals
    
    Out:  
           sales
    AZ     2448009
    CA     2832270
    CO     1495486
    WA     595859
    
    In:   state_office_sales / missing_index_totals 
    
    Out:  ValueError: cannot join with no overlapping index names
    
    0 讨论(0)
  • 2020-11-22 07:11

    Simple way I have used is a merge after the 2 groupby's then doing simple division.

    import numpy as np
    import pandas as pd
    np.random.seed(0)
    df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
    
    state_office = df.groupby(['state', 'office_id'])['sales'].sum().reset_index()
    state = df.groupby(['state'])['sales'].sum().reset_index()
    state_office = state_office.merge(state, left_on='state', right_on ='state', how = 'left')
    state_office['sales_ratio'] = 100*(state_office['sales_x']/state_office['sales_y'])
    
       state  office_id  sales_x  sales_y  sales_ratio
    0     AZ          2   222579  1310725    16.981365
    1     AZ          4   252315  1310725    19.250033
    2     AZ          6   835831  1310725    63.768601
    3     CA          1   405711  2098663    19.331879
    4     CA          3   710581  2098663    33.858747
    5     CA          5   982371  2098663    46.809373
    6     CO          1   404137  1096653    36.851857
    7     CO          3   217952  1096653    19.874290
    8     CO          5   474564  1096653    43.273852
    9     WA          2   535829  1543854    34.707233
    10    WA          4   548242  1543854    35.511259
    11    WA          6   459783  1543854    29.781508
    
    0 讨论(0)
  • 2020-11-22 07:14

    I realize there are already good answers here.

    I nevertheless would like to contribute my own, because I feel for an elementary, simple question like this, there should be a short solution that is understandable at a glance.

    It should also work in a way that I can add the percentages as a new column, leaving the rest of the dataframe untouched. Last but not least, it should generalize in an obvious way to the case in which there is more than one grouping level (e.g., state and country instead of only state).

    The following snippet fulfills these criteria:

    df['sales_ratio'] = df.groupby(['state'])['sales'].transform(lambda x: x/x.sum())
    

    Note that if you're still using Python 2, you'll have to replace the x in the denominator of the lambda term by float(x).

    0 讨论(0)
  • 2020-11-22 07:19

    The most elegant way to find percentages across columns or index is to use pd.crosstab.

    Sample Data

    df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
    

    The output dataframe is like this

    print(df)
    
            state   office_id   sales
        0   CA  1   764505
        1   WA  2   313980
        2   CO  3   558645
        3   AZ  4   883433
        4   CA  5   301244
        5   WA  6   752009
        6   CO  1   457208
        7   AZ  2   259657
        8   CA  3   584471
        9   WA  4   122358
        10  CO  5   721845
        11  AZ  6   136928
    

    Just specify the index, columns and the values to aggregate. The normalize keyword will calculate % across index or columns depending upon the context.

    result = pd.crosstab(index=df['state'], 
                         columns=df['office_id'], 
                         values=df['sales'], 
                         aggfunc='sum', 
                         normalize='index').applymap('{:.2f}%'.format)
    
    
    
    
    print(result)
    office_id   1   2   3   4   5   6
    state                       
    AZ  0.00%   0.20%   0.00%   0.69%   0.00%   0.11%
    CA  0.46%   0.00%   0.35%   0.00%   0.18%   0.00%
    CO  0.26%   0.00%   0.32%   0.00%   0.42%   0.00%
    WA  0.00%   0.26%   0.00%   0.10%   0.00%   0.63%
    
    0 讨论(0)
提交回复
热议问题