Pandas percentage of total with groupby

匿名 (未验证) 提交于 2019-12-03 02:11:02

问题:

This is obviously simple, but as a numpy newbe I'm getting stuck.

I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office.

I want to calculate the percentage of sales per office in a given state (total of all percentages in each state is 100%).

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,                    'office_id': range(1, 7) * 2,                    'sales': [np.random.randint(100000, 999999)                              for _ in range(12)]})  df.groupby(['state', 'office_id']).agg({'sales': 'sum'}) 

This returns:

                  sales state office_id         AZ    2          839507       4          373917       6          347225 CA    1          798585       3          890850       5          454423 CO    1          819975       3          202969       5          614011 WA    2          163942       4          369858       6          959285 

I can't seem to figure out how to "reach up" to the state level of the groupby to total up the sales for the entire state to calculate the fraction.

回答1:

Paul H's answer is right that you will have to make a second groupby object, but you can calculate the percentage in a simpler way -- just groupby the state_office and divide the sales column by its sum. Copying the beginning of Paul H's answer:

# From Paul H import numpy as np import pandas as pd np.random.seed(0) df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,                    'office_id': list(range(1, 7)) * 2,                    'sales': [np.random.randint(100000, 999999)                              for _ in range(12)]}) state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'}) # Change: groupby state_office and divide by sum state_pcts = state_office.groupby(level=0).apply(lambda x:                                                  100 * x / float(x.sum())) 

Returns:

                     sales state office_id            AZ    2          16.981365       4          19.250033       6          63.768601 CA    1          19.331879       3          33.858747       5          46.809373 CO    1          36.851857       3          19.874290       5          43.273852 WA    2          34.707233       4          35.511259       6          29.781508 


回答2:

You need to make a second groupby object that groups by the states, and then use the div method:

import numpy as np import pandas as pd np.random.seed(0) df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,                'office_id': list(range(1, 7)) * 2,                'sales': [np.random.randint(100000, 999999) for _ in range(12)]})  state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'}) state = df.groupby(['state']).agg({'sales': 'sum'}) state_office.div(state, level='state') * 100                        sales state office_id            AZ    2          16.981365       4          19.250033       6          63.768601 CA    1          19.331879       3          33.858747       5          46.809373 CO    1          36.851857       3          19.874290       5          43.273852 WA    2          34.707233       4          35.511259       6          29.781508 

the level='state' kwarg in div tells pandas to broadcast/join the dataframes base on the values in the state level of the index.



回答3:

I know that this is an old question, but exp1orer's answer is very slow for datasets with a large number unique groups (probably because of the lambda). I built off of their answer to turn it into an array calculation so now it's super fast! Below is the example code:

Create the test dataframe with 50,000 unique groups

import random import string import pandas as pd import numpy as np np.random.seed(0)  # This is the total number of groups to be created NumberOfGroups = 50000  # Create a lot of groups (random strings of 4 letters) Group1     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10 Group2     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2 FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]  # Make the numbers NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]  # Make the dataframe df = pd.DataFrame({'Group 1': Group1,                    'Group 2': Group2,                    'Final Group': FinalGroup,                    'Numbers I want as percents': NumbersForPercents}) 

When grouped it looks like:

                             Numbers I want as percents Group 1 Group 2 Final Group                             AAAH    AQYR    RMCH                                847                 XDCL                                182         DQGO    ALVF                                132                 AVPH                                894         OVGH    NVOO                                650                 VKQP                                857         VNLY    HYFW                                884                 MOYH                                469         XOOC    GIDS                                168                 HTOY                                544 AACE    HNXU    RAXK                                243                 YZNK                                750         NOYI    NYGC                                399                 ZYCI                                614         QKGK    CRLF                                520                 UXNA                                970         TXAR    MLNB                                356                 NMFJ                                904         VQYG    NPON                                504                 QPKQ                                948 ... [50000 rows x 1 columns] 

Array method of finding percentage:

# Initial grouping (basically a sorted version of df) PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index() # Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index) SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index() # Merge the two dataframes Percents_df = pd.merge(PreGroupby_df, SumGroup_df) # Divide the two columns Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100 # Drop the extra _Sum column Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1) 

This method takes about ~0.15 seconds

Top answer method (using lambda function):

state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'}) state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum())) 

This method takes about ~21 seconds to produce the same result.

The result:

      Group 1 Group 2 Final Group  Numbers I want as percents  Percent of Final Group 0        AAAH    AQYR        RMCH                         847               82.312925 1        AAAH    AQYR        XDCL                         182               17.687075 2        AAAH    DQGO        ALVF                         132               12.865497 3        AAAH    DQGO        AVPH                         894               87.134503 4        AAAH    OVGH        NVOO                         650               43.132050 5        AAAH    OVGH        VKQP                         857               56.867950 6        AAAH    VNLY        HYFW                         884               65.336290 7        AAAH    VNLY        MOYH                         469               34.663710 8        AAAH    XOOC        GIDS                         168               23.595506 9        AAAH    XOOC        HTOY                         544               76.404494 


回答4:

You can sum the whole DataFrame and divide by the state total:

# Copying setup from Paul H answer import numpy as np import pandas as pd np.random.seed(0) df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,                'office_id': list(range(1, 7)) * 2,                'sales': [np.random.randint(100000, 999999) for _ in range(12)]}) # Add a column with the sales divided by state total sales. df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']  df 

Returns

    office_id   sales state  sales_ratio 0           1  405711    CA     0.193319 1           2  535829    WA     0.347072 2           3  217952    CO     0.198743 3           4  252315    AZ     0.192500 4           5  982371    CA     0.468094 5           6  459783    WA     0.297815 6           1  404137    CO     0.368519 7           2  222579    AZ     0.169814 8           3  710581    CA     0.338587 9           4  548242    WA     0.355113 10          5  474564    CO     0.432739 11          6  835831    AZ     0.637686 

But note that this only works because all columns other than state are numeric, enabling summation of the entire DataFrame. For example, if office_id is character instead, you get an error:

df.office_id = df.office_id.astype(str) df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales'] 

TypeError: unsupported operand type(s) for /: 'str' and 'str'



回答5:

For conciseness I'd use the SeriesGroupBy:

In [11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")  In [12]: c Out[12]: state  office_id AZ     2            925105        4            592852        6            362198 CA     1            819164        3            743055        5            292885 CO     1            525994        3            338378        5            490335 WA     2            623380        4            441560        6            451428 Name: count, dtype: int64  In [13]: c / c.groupby(level=0).sum() Out[13]: state  office_id AZ     2            0.492037        4            0.315321        6            0.192643 CA     1            0.441573        3            0.400546        5            0.157881 CO     1            0.388271        3            0.249779        5            0.361949 WA     2            0.411101        4            0.291196        6            0.297703 Name: count, dtype: float64 

For multiple groups you have to use transform (using Radical's df):

In [21]: c =  df.groupby(["Group 1","Group 2","Final Group"])["Numbers I want as percents"].sum().rename("count")  In [22]: c / c.groupby(level=[0, 1]).transform("sum") Out[22]: Group 1  Group 2  Final Group AAHQ     BOSC     OWON           0.331006                   TLAM           0.668994          MQVF     BWSI           0.288961                   FXZM           0.711039          ODWV     NFCH           0.262395 ... Name: count, dtype: float64 

This seems to be slightly more performant than the other answers (just less than twice the speed of Radical's answer, for me ~0.08s).



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!