Pandas groupby with categories with redundant nan

后端未结

关注

 5  1009

I am having issues using pandas groupby with categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. B

相关标签:

5条回答

迷失自我

2020-12-05 03:05
There is a lot of questions to be answered here.
Let's start by understanding what a 'category' is...

Definition of Categorical dtype

Quoting from pandas docs for "Categorical Data":

Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.

There are two points I want to focus on here:
1. The definition of categoricals as a statistical variable:
  basically, this means we have to look at them from a statistical point of view, not the "regular" programming one. i.e. they are not 'enumerates'. Statistical categorical variables has specific operations and usecases, you can read more about them in wikipedia.
  I'll talk more about this after the second point.
2. Categories are levels in R:
  We can understand more about categoricals if we read about R levels and factors.
  I don't know much about R, but I found this source simple and sufficient. Quoting an interesting example from it:
```
When a factor is first created, all of its levels are stored along with the factor, and if subsets of the factor are extracted, they will retain all of the original levels. This can create problems when constructing model matrices and may or may not be useful when displaying the data using, say, the table function. As an example, consider a random sample from the letters vector, which is part of the base R distribution.

> lets = sample(letters,size=100,replace=TRUE)
> lets = factor(lets)
> table(lets[1:5])

a b c d e f g h i j k l m n o p q r s t u v w x y z
1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1

Even though only five of the levels were actually represented, the table function shows the frequencies for all of the levels of the original factors. To change this, we can simply use another call to factor

> table(factor(lets[1:5]))

a k q s z
1 1 1 1 1
```
Basically this tells us that displaying/using all the categories even if they are not needed is not that uncommon. And actually, it's the default behavior!
This is due to the usual use-cases of categorical variables in statistics. Almost in all the cases you do care about all the categories even if they are not used. Take for example the pandas function cut.

I hope by this point that you understood why this behavior exists in pandas.

GroupBy on Categorical Variables

As of why does groupby consider all the combinations of categories: I can't say for sure, but my best guess based on a quick review of the source code (and the github issue you mentioned), is that they consider the groupby on categorical variables an interaction between them. Hence, It should consider all the pairs/tuples (like a Cartesian product). AFAIK, this helps a lot when you are trying to do something like ANOVA.
This also means that in this context you can't think of it in the usual SQL-like terminology.

Solutions?

Ok, but what if you don't want this behavior?
To the best of my knowledge, and taking into account that I spent the last night tracing this in pandas source code, you can't "disable" it. It's hard coded in every critical step.
However, because of the way groupby works, the actual "expanding" doesn't happen until it's needed. For example, when calling sum over the groups or trying to print them.
Hence, you can do any of the following to get only the needed groups:
```
df.groupby(group_cols).indices
#{('A', 'B', 'C'): array([0]),
# ('A', 'B', 'D'): array([1, 4]),
# ('B', 'A', 'A'): array([3]),
# ('B', 'A', 'C'): array([2])}

df.groupby(group_cols).groups
#{('A', 'B', 'C'): Int64Index([0], dtype='int64'),
# ('A', 'B', 'D'): Int64Index([1, 4], dtype='int64'),
# ('B', 'A', 'A'): Int64Index([3], dtype='int64'),
# ('B', 'A', 'C'): Int64Index([2], dtype='int64')}

# an example
for g in df.groupby(group_cols).groups:
    print(g, grt.get_group(g).sum()[0])
#('A', 'B', 'C') 54.34
#('A', 'B', 'D') 826.74
#('B', 'A', 'A') 765.4
#('B', 'A', 'C') 514.5
```
I know this is a no-go for you, but I'm 99% sure that there is no direct way to do this.
I agree that there should be a boolean variable to disable this behavior and use the "regular" SQL-like one.
0 讨论(0)
发布评论:

提交评论
- 加载中...
悲&欢浪女

2020-12-05 03:06
I found the behavior similar to what's documented in the operations section of Categorical Data.

In particular, similar to
```
In [121]: cats2 = pd.Categorical(["a","a","b","b"], categories=["a","b","c"])

In [122]: df2 = pd.DataFrame({"cats":cats2,"B":["c","d","c","d"], "values":[1,2,3,4]})

In [123]: df2.groupby(["cats","B"]).mean()
Out[123]: 
        values
cats B        
a    c     1.0
     d     2.0
b    c     3.0
     d     4.0
c    c     NaN
     d     NaN
```
Some other words describing the related behavior in Series and groupby. There is also a pivot table example in the end of the section.

Apart from Series.min(), Series.max() and Series.mode(), the following operations are possible with categorical data:

Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data:

Groupby will also show “unused” categories:

The words and the example are cited from Categorical Data.
0 讨论(0)
发布评论:

提交评论
- 加载中...
难免孤独

2020-12-05 03:27
I found this post while debugging something similar. Very good post, and I really like the inclusion of boundary conditions!

Here's the code that accomplishes the initial goal:
```
r = df.groupby(group_cols, as_index=False).agg({'Value': 'sum'})

r.columns = ['_'.join(col).strip('_') for col in r.columns]
```
The downside of this solution is that it results in a hierarchical column index that you may want to flatten (especially if you have multiple statistics). I included flattening of column index in the code above.

I don't know why instance methods:
```
df.groupby(group_cols).sum() 
df.groupby(group_cols).mean()
df.groupby(group_cols).stdev()
```
use all unique combinations of categorical variables, while the .agg() method:
```
df.groupby(group_cols).agg(['count', 'sum', 'mean', 'std']) 
```
ignores the unused level combinations of the groups. That seems inconsistent. Just happy that we can use the .agg() method and not have to worry about a Cartesian combination explosion.

Also, I think it is very common to have a much lower unique cardinality count vs. the Cartesian product. Think of all the cases where data has columns like "State", "County", 'Zip"... these are all nested variables and many data sets out there have variables that have a high degree of nesting.

In our case the difference between Cartesian product of the grouping variables and the naturally occurring combinations is over 1000x (and the starting data set is over 1,000,000 rows).

Consequently, I would have voted for making observed=True the default behavior.
0 讨论(0)
发布评论:

提交评论
- 加载中...

醉梦人生

2020-12-05 03:30

Since Pandas 0.23.0, the groupby method can now take a parameter observed which fixes this issue if it is set to True (False by default). Below is the exact same code as in the question with just observed=True added :

import pandas as pd

group_cols = ['Group1', 'Group2', 'Group3']

df = pd.DataFrame([['A', 'B', 'C', 54.34],
                   ['A', 'B', 'D', 61.34],
                   ['B', 'A', 'C', 514.5],
                   ['B', 'A', 'A', 765.4],
                   ['A', 'B', 'D', 765.4]],
                  columns=(group_cols+['Value']))

for col in group_cols:
    df[col] = df[col].astype('category')

df.groupby(group_cols, as_index=False, observed=True).sum()

0 讨论(0)

失恋的感觉

2020-12-05 03:30

I was able to get a solution that should work really well. I'll edit my post with a better explanation. But in the mean time, does this work well for you?

import pandas as pd

group_cols = ['Group1', 'Group2', 'Group3']

df = pd.DataFrame([['A', 'B', 'C', 54.34],
                   ['A', 'B', 'D', 61.34],
                   ['B', 'A', 'C', 514.5],
                   ['B', 'A', 'A', 765.4],
                   ['A', 'B', 'D', 765.4]],
                  columns=(group_cols+['Value']))
for col in group_cols:
    df[col] = df[col].astype('category')


result = df.groupby([df[col].values.codes for col in group_cols]).sum()
result = result.reset_index()
level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
result = result.rename(columns=level_to_column_name)
for col in group_cols:
    result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
result

So the answer to this felt more like a proper programming than a normal Pandas question. Under the hood, all categorical series are just a bunch of numbers that index into a name of categories. I did a groupby on these underlying numbers because they don't have the same problem as categorical columns. After doing this I had to rename the columns. I then used the from_codes constructor to create efficiently turn the list of integers back into a categorical column.

Group1  Group2  Group3  Value
A       B       C       54.34
A       B       D       826.74
B       A       A       765.40
B       A       C       514.50

So I understand that this isn't exactly your answer but I've made my solution into a little function for people that have this problem in the future.

def categorical_groupby(df,group_cols,agg_fuction="sum"):
    "Does a groupby on a number of categorical columns"
    result = df.groupby([df[col].values.codes for col in group_cols]).agg(agg_fuction)
    result = result.reset_index()
    level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
    result = result.rename(columns=level_to_column_name)
    for col in group_cols:
        result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
    return result

call it like this:

df.pipe(categorical_groupby,group_cols)

0 讨论(0)

Pandas groupby with categories with redundant nan

Definition of Categorical dtype

GroupBy on Categorical Variables

Solutions?