Is pandas.DataFrame.groupby Guaranteed To Be Stable?

前端 未结 1 1682
Happy的楠姐
Happy的楠姐 2021-01-04 18:21

I\'ve noticed that there are several uses of pd.DataFrame.groupby followed by an apply implicitly assuming that groupby is stable - that is, if

1条回答
  •  执念已碎
    2021-01-04 18:34

    Although the docs don't state this internally, it uses stable sort when generating the groups.

    See:

    • https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L291
    • https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L4356

    As I mentioned in the comments, this is important if you consider transform which will return a Series with it's index aligned to the original df. If the sorting didn't preserve the order, then this would make alignment perform additional work as it would need to sort the Series prior to assigning. In fact, this is mentioned in the comments:

    _algos.groupsort_indexer implements counting sort and it is at least O(ngroups), where

    ngroups = prod(shape)

    shape = map(len, keys)

    That is, linear in the number of combinations (cartesian product) of unique values of groupby keys. This can be huge when doing multi-key groupby. np.argsort(kind='mergesort') is O(count x log(count)) where count is the length of the data-frame; Both algorithms are stable sort and that is necessary for correctness of groupby operations.

    e.g. consider: df.groupby(key)[col].transform('first')

    0 讨论(0)
提交回复
热议问题