Pandas percentage of total with groupby

前端 未结 14 2225
没有蜡笔的小新
没有蜡笔的小新 2020-11-22 06:41

This is obviously simple, but as a numpy newbe I\'m getting stuck.

I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office

14条回答
  •  南笙
    南笙 (楼主)
    2020-11-22 06:55

    I know that this is an old question, but exp1orer's answer is very slow for datasets with a large number unique groups (probably because of the lambda). I built off of their answer to turn it into an array calculation so now it's super fast! Below is the example code:

    Create the test dataframe with 50,000 unique groups

    import random
    import string
    import pandas as pd
    import numpy as np
    np.random.seed(0)
    
    # This is the total number of groups to be created
    NumberOfGroups = 50000
    
    # Create a lot of groups (random strings of 4 letters)
    Group1     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
    Group2     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
    FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]
    
    # Make the numbers
    NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]
    
    # Make the dataframe
    df = pd.DataFrame({'Group 1': Group1,
                       'Group 2': Group2,
                       'Final Group': FinalGroup,
                       'Numbers I want as percents': NumbersForPercents})
    

    When grouped it looks like:

                                 Numbers I want as percents
    Group 1 Group 2 Final Group                            
    AAAH    AQYR    RMCH                                847
                    XDCL                                182
            DQGO    ALVF                                132
                    AVPH                                894
            OVGH    NVOO                                650
                    VKQP                                857
            VNLY    HYFW                                884
                    MOYH                                469
            XOOC    GIDS                                168
                    HTOY                                544
    AACE    HNXU    RAXK                                243
                    YZNK                                750
            NOYI    NYGC                                399
                    ZYCI                                614
            QKGK    CRLF                                520
                    UXNA                                970
            TXAR    MLNB                                356
                    NMFJ                                904
            VQYG    NPON                                504
                    QPKQ                                948
    ...
    [50000 rows x 1 columns]
    

    Array method of finding percentage:

    # Initial grouping (basically a sorted version of df)
    PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
    # Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
    SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
    # Merge the two dataframes
    Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
    # Divide the two columns
    Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
    # Drop the extra _Sum column
    Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)
    

    This method takes about ~0.15 seconds

    Top answer method (using lambda function):

    state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
    state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))
    

    This method takes about ~21 seconds to produce the same result.

    The result:

          Group 1 Group 2 Final Group  Numbers I want as percents  Percent of Final Group
    0        AAAH    AQYR        RMCH                         847               82.312925
    1        AAAH    AQYR        XDCL                         182               17.687075
    2        AAAH    DQGO        ALVF                         132               12.865497
    3        AAAH    DQGO        AVPH                         894               87.134503
    4        AAAH    OVGH        NVOO                         650               43.132050
    5        AAAH    OVGH        VKQP                         857               56.867950
    6        AAAH    VNLY        HYFW                         884               65.336290
    7        AAAH    VNLY        MOYH                         469               34.663710
    8        AAAH    XOOC        GIDS                         168               23.595506
    9        AAAH    XOOC        HTOY                         544               76.404494
    

提交回复
热议问题