问题
This is somewhat of a broad topic, but I will try to pare it to some specific questions.
I have noticed a difference between resample
and groupby
that I am curious to learn about. Here is some hourly time series data:
In[]:
import pandas as pd
dr = pd.date_range('01-01-2020 8:00', periods=10, freq='H')
df = pd.DataFrame({'A':range(10),
'B':range(10,20),
'C':range(20,30)}, index=dr)
df
Out[]:
A B C
2020-01-01 08:00:00 0 10 20
2020-01-01 09:00:00 1 11 21
2020-01-01 10:00:00 2 12 22
2020-01-01 11:00:00 3 13 23
2020-01-01 12:00:00 4 14 24
2020-01-01 13:00:00 5 15 25
2020-01-01 14:00:00 6 16 26
2020-01-01 15:00:00 7 17 27
2020-01-01 16:00:00 8 18 28
2020-01-01 17:00:00 9 19 29
I can downsample the data using either groupby
with a freq
pandas.Grouper
or resample
(which seems the more typical thing to do):
g = df.groupby(pd.Grouper(freq='2H'))
r = df.resample(rule='2H')
My impression was that these two were essentially the same thing (and correct me if I am wrong but resample
is a rebranded groupby
)? But I have found that when using the apply
method of each grouped object, you can index specific columns in the "DataFrameGroupBy" g
object but not the "Resampler" object r
:
def foo(d):
return(d['A'] - d['B'] + 2*d['C'])
In[]:
g.apply(foo)
Out[]:
2020-01-01 08:00:00 2020-01-01 08:00:00 30
2020-01-01 09:00:00 32
2020-01-01 10:00:00 2020-01-01 10:00:00 34
2020-01-01 11:00:00 36
2020-01-01 12:00:00 2020-01-01 12:00:00 38
2020-01-01 13:00:00 40
2020-01-01 14:00:00 2020-01-01 14:00:00 42
2020-01-01 15:00:00 44
2020-01-01 16:00:00 2020-01-01 16:00:00 46
2020-01-01 17:00:00 48
dtype: int64
In[]:
r.apply(foo)
Out[]:
#long multi-Exception error stack ending in:
KeyError: 'A'
It looks like the data d
that the apply
"sees" is different in each case, as shown by:
def bar(d):
print(d)
In[]:
g.apply(bar)
Out[]:
A B C
2020-01-01 08:00:00 0 10 20
2020-01-01 09:00:00 1 11 21
... #more DataFrames corresponding to each bin
In[]:
r.apply(bar)
Out[]:
2020-01-01 08:00:00 0
2020-01-01 09:00:00 1
Name: A, dtype: int64
2020-01-01 10:00:00 2
2020-01-01 11:00:00 3
Name: A, dtype: int64
... #more Series, first the bins for column "A", then "B", then "C"
However, if you simply iterate over the Resampler object, you get the bins as DataFrames, which seems similar to groupby
:
In[]:
for i, d in r:
print(d)
Out[]:
A B C
2020-01-01 08:00:00 0 10 20
2020-01-01 09:00:00 1 11 21
A B C
2020-01-01 10:00:00 2 12 22
2020-01-01 11:00:00 3 13 23
A B C
2020-01-01 12:00:00 4 14 24
2020-01-01 13:00:00 5 15 25
A B C
2020-01-01 14:00:00 6 16 26
2020-01-01 15:00:00 7 17 27
A B C
2020-01-01 16:00:00 8 18 28
2020-01-01 17:00:00 9 19 29
The printout is the same when iterating over the DataFrameGroupBy object.
My questions based on the above?
- Can you access specific columns using
resample
andapply
? I thought I had code where I did this but now I think I am mistaken. - Why does the
resample
apply
work on Series for each column for each bin, instead ofDataFrames
for each bin?
Any general comments about what is going on here, or whether this pattern should be encouraged or discouraged, would also be appreciated. Thanks!
来源:https://stackoverflow.com/questions/62902115/what-is-the-difference-between-bins-when-using-groupby-apply-vs-resample-apply