What is the difference between bins when using groupby apply vs resample apply?

人走茶凉 提交于 2021-02-11 15:34:32

问题


This is somewhat of a broad topic, but I will try to pare it to some specific questions.

I have noticed a difference between resample and groupby that I am curious to learn about. Here is some hourly time series data:

In[]:
import pandas as pd

dr = pd.date_range('01-01-2020 8:00', periods=10, freq='H')
df = pd.DataFrame({'A':range(10),
                   'B':range(10,20),
                   'C':range(20,30)}, index=dr)
df

Out[]:
                     A   B   C
2020-01-01 08:00:00  0  10  20
2020-01-01 09:00:00  1  11  21
2020-01-01 10:00:00  2  12  22
2020-01-01 11:00:00  3  13  23
2020-01-01 12:00:00  4  14  24
2020-01-01 13:00:00  5  15  25
2020-01-01 14:00:00  6  16  26
2020-01-01 15:00:00  7  17  27
2020-01-01 16:00:00  8  18  28
2020-01-01 17:00:00  9  19  29

I can downsample the data using either groupby with a freq pandas.Grouper or resample (which seems the more typical thing to do):

g = df.groupby(pd.Grouper(freq='2H'))
r = df.resample(rule='2H')

My impression was that these two were essentially the same thing (and correct me if I am wrong but resampleis a rebranded groupby)? But I have found that when using the apply method of each grouped object, you can index specific columns in the "DataFrameGroupBy" g object but not the "Resampler" object r:

def foo(d):
    return(d['A'] - d['B'] + 2*d['C'])

In[]:
g.apply(foo)

Out[]:
2020-01-01 08:00:00  2020-01-01 08:00:00    30
                     2020-01-01 09:00:00    32
2020-01-01 10:00:00  2020-01-01 10:00:00    34
                     2020-01-01 11:00:00    36
2020-01-01 12:00:00  2020-01-01 12:00:00    38
                     2020-01-01 13:00:00    40
2020-01-01 14:00:00  2020-01-01 14:00:00    42
                     2020-01-01 15:00:00    44
2020-01-01 16:00:00  2020-01-01 16:00:00    46
                     2020-01-01 17:00:00    48
dtype: int64

In[]:
r.apply(foo)

Out[]:
#long multi-Exception error stack ending in:
KeyError: 'A'

It looks like the data d that the apply "sees" is different in each case, as shown by:

def bar(d):
    print(d)

In[]:
g.apply(bar)

Out[]:
                     A   B   C
2020-01-01 08:00:00  0  10  20
2020-01-01 09:00:00  1  11  21
... #more DataFrames corresponding to each bin

In[]:
r.apply(bar)

Out[]:
2020-01-01 08:00:00    0
2020-01-01 09:00:00    1
Name: A, dtype: int64
2020-01-01 10:00:00    2
2020-01-01 11:00:00    3
Name: A, dtype: int64
... #more Series, first the bins for column "A", then "B", then "C" 

However, if you simply iterate over the Resampler object, you get the bins as DataFrames, which seems similar to groupby:

In[]:
for i, d in r:
    print(d)

Out[]:
                    A   B   C
2020-01-01 08:00:00  0  10  20
2020-01-01 09:00:00  1  11  21
                     A   B   C
2020-01-01 10:00:00  2  12  22
2020-01-01 11:00:00  3  13  23
                     A   B   C
2020-01-01 12:00:00  4  14  24
2020-01-01 13:00:00  5  15  25
                     A   B   C
2020-01-01 14:00:00  6  16  26
2020-01-01 15:00:00  7  17  27
                     A   B   C
2020-01-01 16:00:00  8  18  28
2020-01-01 17:00:00  9  19  29

The printout is the same when iterating over the DataFrameGroupBy object.

My questions based on the above?

  • Can you access specific columns using resample and apply? I thought I had code where I did this but now I think I am mistaken.
  • Why does the resample apply work on Series for each column for each bin, instead of DataFrames for each bin?

Any general comments about what is going on here, or whether this pattern should be encouraged or discouraged, would also be appreciated. Thanks!

来源:https://stackoverflow.com/questions/62902115/what-is-the-difference-between-bins-when-using-groupby-apply-vs-resample-apply

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!