可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
This is obviously simple, but as a numpy newbe I'm getting stuck.
I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office.
I want to calculate the percentage of sales per office in a given state (total of all percentages in each state is 100%).
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3, 'office_id': range(1, 7) * 2, 'sales': [np.random.randint(100000, 999999) for _ in range(12)]}) df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
This returns:
sales state office_id AZ 2 839507 4 373917 6 347225 CA 1 798585 3 890850 5 454423 CO 1 819975 3 202969 5 614011 WA 2 163942 4 369858 6 959285
I can't seem to figure out how to "reach up" to the state
level of the groupby
to total up the sales
for the entire state
to calculate the fraction.
回答1:
Paul H's answer is right that you will have to make a second groupby
object, but you can calculate the percentage in a simpler way -- just groupby
the state_office
and divide the sales
column by its sum. Copying the beginning of Paul H's answer:
# From Paul H import numpy as np import pandas as pd np.random.seed(0) df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3, 'office_id': list(range(1, 7)) * 2, 'sales': [np.random.randint(100000, 999999) for _ in range(12)]}) state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'}) # Change: groupby state_office and divide by sum state_pcts = state_office.groupby(level=0).apply(lambda x: 100 * x / float(x.sum()))
Returns:
sales state office_id AZ 2 16.981365 4 19.250033 6 63.768601 CA 1 19.331879 3 33.858747 5 46.809373 CO 1 36.851857 3 19.874290 5 43.273852 WA 2 34.707233 4 35.511259 6 29.781508
回答2:
You need to make a second groupby object that groups by the states, and then use the div
method:
import numpy as np import pandas as pd np.random.seed(0) df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3, 'office_id': list(range(1, 7)) * 2, 'sales': [np.random.randint(100000, 999999) for _ in range(12)]}) state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'}) state = df.groupby(['state']).agg({'sales': 'sum'}) state_office.div(state, level='state') * 100 sales state office_id AZ 2 16.981365 4 19.250033 6 63.768601 CA 1 19.331879 3 33.858747 5 46.809373 CO 1 36.851857 3 19.874290 5 43.273852 WA 2 34.707233 4 35.511259 6 29.781508
the level='state'
kwarg in div
tells pandas to broadcast/join the dataframes base on the values in the state
level of the index.
回答3:
I know that this is an old question, but exp1orer's answer is very slow for datasets with a large number unique groups (probably because of the lambda). I built off of their answer to turn it into an array calculation so now it's super fast! Below is the example code:
Create the test dataframe with 50,000 unique groups
import random import string import pandas as pd import numpy as np np.random.seed(0) # This is the total number of groups to be created NumberOfGroups = 50000 # Create a lot of groups (random strings of 4 letters) Group1 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10 Group2 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2 FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)] # Make the numbers NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)] # Make the dataframe df = pd.DataFrame({'Group 1': Group1, 'Group 2': Group2, 'Final Group': FinalGroup, 'Numbers I want as percents': NumbersForPercents})
When grouped it looks like:
Numbers I want as percents Group 1 Group 2 Final Group AAAH AQYR RMCH 847 XDCL 182 DQGO ALVF 132 AVPH 894 OVGH NVOO 650 VKQP 857 VNLY HYFW 884 MOYH 469 XOOC GIDS 168 HTOY 544 AACE HNXU RAXK 243 YZNK 750 NOYI NYGC 399 ZYCI 614 QKGK CRLF 520 UXNA 970 TXAR MLNB 356 NMFJ 904 VQYG NPON 504 QPKQ 948 ... [50000 rows x 1 columns]
Array method of finding percentage:
# Initial grouping (basically a sorted version of df) PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index() # Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index) SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index() # Merge the two dataframes Percents_df = pd.merge(PreGroupby_df, SumGroup_df) # Divide the two columns Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100 # Drop the extra _Sum column Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)
This method takes about ~0.15 seconds
Top answer method (using lambda function):
state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'}) state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))
This method takes about ~21 seconds to produce the same result.
The result:
Group 1 Group 2 Final Group Numbers I want as percents Percent of Final Group 0 AAAH AQYR RMCH 847 82.312925 1 AAAH AQYR XDCL 182 17.687075 2 AAAH DQGO ALVF 132 12.865497 3 AAAH DQGO AVPH 894 87.134503 4 AAAH OVGH NVOO 650 43.132050 5 AAAH OVGH VKQP 857 56.867950 6 AAAH VNLY HYFW 884 65.336290 7 AAAH VNLY MOYH 469 34.663710 8 AAAH XOOC GIDS 168 23.595506 9 AAAH XOOC HTOY 544 76.404494
回答4:
You can sum
the whole DataFrame
and divide by the state
total:
# Copying setup from Paul H answer import numpy as np import pandas as pd np.random.seed(0) df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3, 'office_id': list(range(1, 7)) * 2, 'sales': [np.random.randint(100000, 999999) for _ in range(12)]}) # Add a column with the sales divided by state total sales. df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales'] df
Returns
office_id sales state sales_ratio 0 1 405711 CA 0.193319 1 2 535829 WA 0.347072 2 3 217952 CO 0.198743 3 4 252315 AZ 0.192500 4 5 982371 CA 0.468094 5 6 459783 WA 0.297815 6 1 404137 CO 0.368519 7 2 222579 AZ 0.169814 8 3 710581 CA 0.338587 9 4 548242 WA 0.355113 10 5 474564 CO 0.432739 11 6 835831 AZ 0.637686
But note that this only works because all columns other than state
are numeric, enabling summation of the entire DataFrame. For example, if office_id
is character instead, you get an error:
df.office_id = df.office_id.astype(str) df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']
TypeError: unsupported operand type(s) for /: 'str' and 'str'
回答5:
For conciseness I'd use the SeriesGroupBy:
In [11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count") In [12]: c Out[12]: state office_id AZ 2 925105 4 592852 6 362198 CA 1 819164 3 743055 5 292885 CO 1 525994 3 338378 5 490335 WA 2 623380 4 441560 6 451428 Name: count, dtype: int64 In [13]: c / c.groupby(level=0).sum() Out[13]: state office_id AZ 2 0.492037 4 0.315321 6 0.192643 CA 1 0.441573 3 0.400546 5 0.157881 CO 1 0.388271 3 0.249779 5 0.361949 WA 2 0.411101 4 0.291196 6 0.297703 Name: count, dtype: float64
For multiple groups you have to use transform (using Radical's df):
In [21]: c = df.groupby(["Group 1","Group 2","Final Group"])["Numbers I want as percents"].sum().rename("count") In [22]: c / c.groupby(level=[0, 1]).transform("sum") Out[22]: Group 1 Group 2 Final Group AAHQ BOSC OWON 0.331006 TLAM 0.668994 MQVF BWSI 0.288961 FXZM 0.711039 ODWV NFCH 0.262395 ... Name: count, dtype: float64
This seems to be slightly more performant than the other answers (just less than twice the speed of Radical's answer, for me ~0.08s).