I am having issues using pandas groupby with categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. B
There is a lot of questions to be answered here.
Let's start by understanding what a 'category' is...
Quoting from pandas docs for "Categorical Data":
Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.
There are two points I want to focus on here:
The definition of categoricals as a statistical variable:
basically, this means we have to look at them from a statistical point of view, not the "regular" programming one. i.e. they are not 'enumerates'. Statistical categorical variables has specific operations and usecases, you can read more about them in wikipedia.
I'll talk more about this after the second point.
Categories are levels in R:
We can understand more about categoricals if we read about R
levels and factors.
I don't know much about R, but I found this source simple and sufficient. Quoting an interesting example from it:
When a factor is first created, all of its levels are stored along with the factor, and if subsets of the factor are extracted, they will retain all of the original levels. This can create problems when constructing model matrices and may or may not be useful when displaying the data using, say, the table function. As an example, consider a random sample from the letters vector, which is part of the base R distribution. > lets = sample(letters,size=100,replace=TRUE) > lets = factor(lets) > table(lets[1:5]) a b c d e f g h i j k l m n o p q r s t u v w x y z 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 Even though only five of the levels were actually represented, the table function shows the frequencies for all of the levels of the original factors. To change this, we can simply use another call to factor > table(factor(lets[1:5])) a k q s z 1 1 1 1 1
Basically this tells us that displaying/using all the categories even if they are not needed is not that uncommon. And actually, it's the default behavior!
This is due to the usual use-cases of categorical variables in statistics. Almost in all the cases you do care about all the categories even if they are not used. Take for example the pandas function cut.
I hope by this point that you understood why this behavior exists in pandas.
As of why does groupby
consider all the combinations of categories: I can't say for sure, but my best guess based on a quick review of the source code (and the github issue you mentioned), is that they consider the groupby
on categorical variables an interaction between them. Hence, It should consider all the pairs/tuples (like a Cartesian product). AFAIK, this helps a lot when you are trying to do something like ANOVA.
This also means that in this context you can't think of it in the usual SQL-like terminology.
Ok, but what if you don't want this behavior?
To the best of my knowledge, and taking into account that I spent the last night tracing this in pandas source code, you can't "disable" it. It's hard coded in every critical step.
However, because of the way groupby
works, the actual "expanding" doesn't happen until it's needed. For example, when calling sum
over the groups or trying to print them.
Hence, you can do any of the following to get only the needed groups:
df.groupby(group_cols).indices
#{('A', 'B', 'C'): array([0]),
# ('A', 'B', 'D'): array([1, 4]),
# ('B', 'A', 'A'): array([3]),
# ('B', 'A', 'C'): array([2])}
df.groupby(group_cols).groups
#{('A', 'B', 'C'): Int64Index([0], dtype='int64'),
# ('A', 'B', 'D'): Int64Index([1, 4], dtype='int64'),
# ('B', 'A', 'A'): Int64Index([3], dtype='int64'),
# ('B', 'A', 'C'): Int64Index([2], dtype='int64')}
# an example
for g in df.groupby(group_cols).groups:
print(g, grt.get_group(g).sum()[0])
#('A', 'B', 'C') 54.34
#('A', 'B', 'D') 826.74
#('B', 'A', 'A') 765.4
#('B', 'A', 'C') 514.5
I know this is a no-go for you, but I'm 99% sure that there is no direct way to do this.
I agree that there should be a boolean variable to disable this behavior and use the "regular" SQL-like one.
I found the behavior similar to what's documented in the operations section of Categorical Data.
In particular, similar to
In [121]: cats2 = pd.Categorical(["a","a","b","b"], categories=["a","b","c"]) In [122]: df2 = pd.DataFrame({"cats":cats2,"B":["c","d","c","d"], "values":[1,2,3,4]}) In [123]: df2.groupby(["cats","B"]).mean() Out[123]: values cats B a c 1.0 d 2.0 b c 3.0 d 4.0 c c NaN d NaN
Some other words describing the related behavior in Series
and groupby
. There is also a pivot table example in the end of the section.
Apart from Series.min(), Series.max() and Series.mode(), the following operations are possible with categorical data:
Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data:
Groupby will also show “unused” categories:
The words and the example are cited from Categorical Data.
I found this post while debugging something similar. Very good post, and I really like the inclusion of boundary conditions!
Here's the code that accomplishes the initial goal:
r = df.groupby(group_cols, as_index=False).agg({'Value': 'sum'})
r.columns = ['_'.join(col).strip('_') for col in r.columns]
The downside of this solution is that it results in a hierarchical column index that you may want to flatten (especially if you have multiple statistics). I included flattening of column index in the code above.
I don't know why instance methods:
df.groupby(group_cols).sum()
df.groupby(group_cols).mean()
df.groupby(group_cols).stdev()
use all unique combinations of categorical variables, while the .agg() method:
df.groupby(group_cols).agg(['count', 'sum', 'mean', 'std'])
ignores the unused level combinations of the groups. That seems inconsistent. Just happy that we can use the .agg() method and not have to worry about a Cartesian combination explosion.
Also, I think it is very common to have a much lower unique cardinality count vs. the Cartesian product. Think of all the cases where data has columns like "State", "County", 'Zip"... these are all nested variables and many data sets out there have variables that have a high degree of nesting.
In our case the difference between Cartesian product of the grouping variables and the naturally occurring combinations is over 1000x (and the starting data set is over 1,000,000 rows).
Consequently, I would have voted for making observed=True the default behavior.
Since Pandas 0.23.0, the groupby method can now take a parameter observed
which fixes this issue if it is set to True
(False by default).
Below is the exact same code as in the question with just observed=True
added :
import pandas as pd
group_cols = ['Group1', 'Group2', 'Group3']
df = pd.DataFrame([['A', 'B', 'C', 54.34],
['A', 'B', 'D', 61.34],
['B', 'A', 'C', 514.5],
['B', 'A', 'A', 765.4],
['A', 'B', 'D', 765.4]],
columns=(group_cols+['Value']))
for col in group_cols:
df[col] = df[col].astype('category')
df.groupby(group_cols, as_index=False, observed=True).sum()
I was able to get a solution that should work really well. I'll edit my post with a better explanation. But in the mean time, does this work well for you?
import pandas as pd
group_cols = ['Group1', 'Group2', 'Group3']
df = pd.DataFrame([['A', 'B', 'C', 54.34],
['A', 'B', 'D', 61.34],
['B', 'A', 'C', 514.5],
['B', 'A', 'A', 765.4],
['A', 'B', 'D', 765.4]],
columns=(group_cols+['Value']))
for col in group_cols:
df[col] = df[col].astype('category')
result = df.groupby([df[col].values.codes for col in group_cols]).sum()
result = result.reset_index()
level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
result = result.rename(columns=level_to_column_name)
for col in group_cols:
result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
result
So the answer to this felt more like a proper programming than a normal Pandas question. Under the hood, all categorical series are just a bunch of numbers that index into a name of categories. I did a groupby on these underlying numbers because they don't have the same problem as categorical columns. After doing this I had to rename the columns. I then used the from_codes constructor to create efficiently turn the list of integers back into a categorical column.
Group1 Group2 Group3 Value
A B C 54.34
A B D 826.74
B A A 765.40
B A C 514.50
So I understand that this isn't exactly your answer but I've made my solution into a little function for people that have this problem in the future.
def categorical_groupby(df,group_cols,agg_fuction="sum"):
"Does a groupby on a number of categorical columns"
result = df.groupby([df[col].values.codes for col in group_cols]).agg(agg_fuction)
result = result.reset_index()
level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
result = result.rename(columns=level_to_column_name)
for col in group_cols:
result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
return result
call it like this:
df.pipe(categorical_groupby,group_cols)