Is pandas.DataFrame.groupby Guaranteed To Be Stable?

前端未结

关注

 1  1682

Happy的楠姐 2021-01-04 18:21

I\'ve noticed that there are several uses of pd.DataFrame.groupby followed by an apply implicitly assuming that groupby is stable - that is, if

1条回答

执念已碎 (楼主)

2021-01-04 18:34

Although the docs don't state this internally, it uses stable sort when generating the groups.

See:

https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L291

https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L4356

As I mentioned in the comments, this is important if you consider transform which will return a Series with it's index aligned to the original df. If the sorting didn't preserve the order, then this would make alignment perform additional work as it would need to sort the Series prior to assigning. In fact, this is mentioned in the comments:

_algos.groupsort_indexer implements counting sort and it is at least O(ngroups), where

ngroups = prod(shape)

shape = map(len, keys)

That is, linear in the number of combinations (cartesian product) of unique values of groupby keys. This can be huge when doing multi-key groupby. np.argsort(kind='mergesort') is O(count x log(count)) where count is the length of the data-frame; Both algorithms are stable sort and that is necessary for correctness of groupby operations.

e.g. consider: df.groupby(key)[col].transform('first')

0 讨论(0)

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复