Pandas percentage of total with groupby

前端 未结 14 2261
没有蜡笔的小新
没有蜡笔的小新 2020-11-22 06:41

This is obviously simple, but as a numpy newbe I\'m getting stuck.

I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office

相关标签:
14条回答
  • 2020-11-22 06:55

    Paul H's answer is right that you will have to make a second groupby object, but you can calculate the percentage in a simpler way -- just groupby the state_office and divide the sales column by its sum. Copying the beginning of Paul H's answer:

    # From Paul H
    import numpy as np
    import pandas as pd
    np.random.seed(0)
    df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                       'office_id': list(range(1, 7)) * 2,
                       'sales': [np.random.randint(100000, 999999)
                                 for _ in range(12)]})
    state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
    # Change: groupby state_office and divide by sum
    state_pcts = state_office.groupby(level=0).apply(lambda x:
                                                     100 * x / float(x.sum()))
    

    Returns:

                         sales
    state office_id           
    AZ    2          16.981365
          4          19.250033
          6          63.768601
    CA    1          19.331879
          3          33.858747
          5          46.809373
    CO    1          36.851857
          3          19.874290
          5          43.273852
    WA    2          34.707233
          4          35.511259
          6          29.781508
    
    0 讨论(0)
  • 2020-11-22 06:55

    I know that this is an old question, but exp1orer's answer is very slow for datasets with a large number unique groups (probably because of the lambda). I built off of their answer to turn it into an array calculation so now it's super fast! Below is the example code:

    Create the test dataframe with 50,000 unique groups

    import random
    import string
    import pandas as pd
    import numpy as np
    np.random.seed(0)
    
    # This is the total number of groups to be created
    NumberOfGroups = 50000
    
    # Create a lot of groups (random strings of 4 letters)
    Group1     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
    Group2     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
    FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]
    
    # Make the numbers
    NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]
    
    # Make the dataframe
    df = pd.DataFrame({'Group 1': Group1,
                       'Group 2': Group2,
                       'Final Group': FinalGroup,
                       'Numbers I want as percents': NumbersForPercents})
    

    When grouped it looks like:

                                 Numbers I want as percents
    Group 1 Group 2 Final Group                            
    AAAH    AQYR    RMCH                                847
                    XDCL                                182
            DQGO    ALVF                                132
                    AVPH                                894
            OVGH    NVOO                                650
                    VKQP                                857
            VNLY    HYFW                                884
                    MOYH                                469
            XOOC    GIDS                                168
                    HTOY                                544
    AACE    HNXU    RAXK                                243
                    YZNK                                750
            NOYI    NYGC                                399
                    ZYCI                                614
            QKGK    CRLF                                520
                    UXNA                                970
            TXAR    MLNB                                356
                    NMFJ                                904
            VQYG    NPON                                504
                    QPKQ                                948
    ...
    [50000 rows x 1 columns]
    

    Array method of finding percentage:

    # Initial grouping (basically a sorted version of df)
    PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
    # Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
    SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
    # Merge the two dataframes
    Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
    # Divide the two columns
    Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
    # Drop the extra _Sum column
    Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)
    

    This method takes about ~0.15 seconds

    Top answer method (using lambda function):

    state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
    state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))
    

    This method takes about ~21 seconds to produce the same result.

    The result:

          Group 1 Group 2 Final Group  Numbers I want as percents  Percent of Final Group
    0        AAAH    AQYR        RMCH                         847               82.312925
    1        AAAH    AQYR        XDCL                         182               17.687075
    2        AAAH    DQGO        ALVF                         132               12.865497
    3        AAAH    DQGO        AVPH                         894               87.134503
    4        AAAH    OVGH        NVOO                         650               43.132050
    5        AAAH    OVGH        VKQP                         857               56.867950
    6        AAAH    VNLY        HYFW                         884               65.336290
    7        AAAH    VNLY        MOYH                         469               34.663710
    8        AAAH    XOOC        GIDS                         168               23.595506
    9        AAAH    XOOC        HTOY                         544               76.404494
    
    0 讨论(0)
  • 2020-11-22 06:56

    One-line solution:

    df.join(
        df.groupby('state').agg(state_total=('sales', 'sum')),
        on='state'
    ).eval('sales / state_total')
    

    This returns a Series of per-office ratios -- can be used on it's own or assigned to the original Dataframe.

    0 讨论(0)
  • 2020-11-22 06:57

    You need to make a second groupby object that groups by the states, and then use the div method:

    import numpy as np
    import pandas as pd
    np.random.seed(0)
    df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
    
    state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
    state = df.groupby(['state']).agg({'sales': 'sum'})
    state_office.div(state, level='state') * 100
    
    
                         sales
    state office_id           
    AZ    2          16.981365
          4          19.250033
          6          63.768601
    CA    1          19.331879
          3          33.858747
          5          46.809373
    CO    1          36.851857
          3          19.874290
          5          43.273852
    WA    2          34.707233
          4          35.511259
          6          29.781508
    

    the level='state' kwarg in div tells pandas to broadcast/join the dataframes base on the values in the state level of the index.

    0 讨论(0)
  • 2020-11-22 06:57
    df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})
    
    grouped = df.groupby(['state', 'office_id'])
    100*grouped.sum()/df[["state","sales"]].groupby('state').sum()
    

    Returns:

    sales
    state   office_id   
    AZ  2   54.587910
        4   33.009225
        6   12.402865
    CA  1   32.046582
        3   44.937684
        5   23.015735
    CO  1   21.099989
        3   31.848658
        5   47.051353
    WA  2   43.882790
        4   10.265275
        6   45.851935
    
    0 讨论(0)
  • 2020-11-22 07:01

    You can sum the whole DataFrame and divide by the state total:

    # Copying setup from Paul H answer
    import numpy as np
    import pandas as pd
    np.random.seed(0)
    df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
    # Add a column with the sales divided by state total sales.
    df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
    
    df
    

    Returns

        office_id   sales state  sales_ratio
    0           1  405711    CA     0.193319
    1           2  535829    WA     0.347072
    2           3  217952    CO     0.198743
    3           4  252315    AZ     0.192500
    4           5  982371    CA     0.468094
    5           6  459783    WA     0.297815
    6           1  404137    CO     0.368519
    7           2  222579    AZ     0.169814
    8           3  710581    CA     0.338587
    9           4  548242    WA     0.355113
    10          5  474564    CO     0.432739
    11          6  835831    AZ     0.637686
    

    But note that this only works because all columns other than state are numeric, enabling summation of the entire DataFrame. For example, if office_id is character instead, you get an error:

    df.office_id = df.office_id.astype(str)
    df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
    

    TypeError: unsupported operand type(s) for /: 'str' and 'str'

    0 讨论(0)
提交回复
热议问题