This is obviously simple, but as a numpy newbe I\'m getting stuck.
I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office
I know that this is an old question, but exp1orer's answer is very slow for datasets with a large number unique groups (probably because of the lambda). I built off of their answer to turn it into an array calculation so now it's super fast! Below is the example code:
Create the test dataframe with 50,000 unique groups
import random
import string
import pandas as pd
import numpy as np
np.random.seed(0)
# This is the total number of groups to be created
NumberOfGroups = 50000
# Create a lot of groups (random strings of 4 letters)
Group1 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
Group2 = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]
# Make the numbers
NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]
# Make the dataframe
df = pd.DataFrame({'Group 1': Group1,
'Group 2': Group2,
'Final Group': FinalGroup,
'Numbers I want as percents': NumbersForPercents})
When grouped it looks like:
Numbers I want as percents
Group 1 Group 2 Final Group
AAAH AQYR RMCH 847
XDCL 182
DQGO ALVF 132
AVPH 894
OVGH NVOO 650
VKQP 857
VNLY HYFW 884
MOYH 469
XOOC GIDS 168
HTOY 544
AACE HNXU RAXK 243
YZNK 750
NOYI NYGC 399
ZYCI 614
QKGK CRLF 520
UXNA 970
TXAR MLNB 356
NMFJ 904
VQYG NPON 504
QPKQ 948
...
[50000 rows x 1 columns]
Array method of finding percentage:
# Initial grouping (basically a sorted version of df)
PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
# Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
# Merge the two dataframes
Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
# Divide the two columns
Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
# Drop the extra _Sum column
Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)
This method takes about ~0.15 seconds
Top answer method (using lambda function):
state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))
This method takes about ~21 seconds to produce the same result.
The result:
Group 1 Group 2 Final Group Numbers I want as percents Percent of Final Group
0 AAAH AQYR RMCH 847 82.312925
1 AAAH AQYR XDCL 182 17.687075
2 AAAH DQGO ALVF 132 12.865497
3 AAAH DQGO AVPH 894 87.134503
4 AAAH OVGH NVOO 650 43.132050
5 AAAH OVGH VKQP 857 56.867950
6 AAAH VNLY HYFW 884 65.336290
7 AAAH VNLY MOYH 469 34.663710
8 AAAH XOOC GIDS 168 23.595506
9 AAAH XOOC HTOY 544 76.404494