问题
I have run into a property which I find peculiar about resampling Booleans in pandas
. Here is some time series data:
import pandas as pd
import numpy as np
dr = pd.date_range('01-01-2020 5:00', periods=10, freq='H')
df = pd.DataFrame({'Bools':[True,True,False,False,False,True,True,np.nan,np.nan,False],
"Nums":range(10)},
index=dr)
So the data look like:
Bools Nums
2020-01-01 05:00:00 True 0
2020-01-01 06:00:00 True 1
2020-01-01 07:00:00 False 2
2020-01-01 08:00:00 False 3
2020-01-01 09:00:00 False 4
2020-01-01 10:00:00 True 5
2020-01-01 11:00:00 True 6
2020-01-01 12:00:00 NaN 7
2020-01-01 13:00:00 NaN 8
2020-01-01 14:00:00 False 9
I would have thought I could do simple operations (like a sum) on the boolean column when resampling, but (as is) this fails:
>>> df.resample('5H').sum()
Nums
2020-01-01 05:00:00 10
2020-01-01 10:00:00 35
The "Bools" column is dropped. My impression of why this happens was b/c the dtype
of the column is object
. Changing that remedies the issue:
>>> r = df.resample('5H')
>>> copy = df.copy() #just doing this to preserve df for the example
>>> copy['Bools'] = copy['Bools'].astype(float)
>>> copy.resample('5H').sum()
Bools Nums
2020-01-01 05:00:00 2.0 10
2020-01-01 10:00:00 2.0 35
But (oddly) you can still sum the Booleans by indexing the resample object without changing the dtype
:
>>> r = df.resample('5H')
>>> r['Bools'].sum()
2020-01-01 05:00:00 2
2020-01-01 10:00:00 2
Freq: 5H, Name: Bools, dtype: int64
And also if the only column is the Booleans, you can still resample (despite the column still being object
):
>>> df.drop(['Nums'],axis=1).resample('5H').sum()
Bools
2020-01-01 05:00:00 2
2020-01-01 10:00:00 2
What allows the latter two examples to work? I can see maybe they are a little more explicit ("Please, I really want to resample this column!"), but I don't see why the original resample
doesn't allow the operation if it can be done.
回答1:
Well, tracking down shows that:
df.resample('5H')['Bools'].sum == Groupby.sum (in pd.core.groupby.generic.SeriesGroupBy)
df.resample('5H').sum == sum (in pandas.core.resample.DatetimeIndexResampler)
and tracking groupby_function
in groupby.py shows that it's equivalent to
r.agg(lambda x: np.sum(x, axis=r.axis))
where r = df.resample('5H')
which outputs:
Bools Nums Nums2
2020-01-01 05:00:00 2 10 10
2020-01-01 10:00:00 2 35 35
well, actually, it should've been r = df.resample('5H')['Bool']
(only for the case above)
and tracking down the _downsample
function in resample.py shows that it's equivalent to:
df.groupby(r.grouper, axis=r.axis).agg(np.sum)
which outputs:
Nums Nums2
2020-01-01 05:00:00 10 10
2020-01-01 10:00:00 35 35
回答2:
df.resample('5H').sum()
doesn't work on Bools
column because the column has mixed data type, which is object
in pandas. When calling sum()
on resample
or groupby
, object
typed columns will be ignored.
来源:https://stackoverflow.com/questions/62903838/resampling-boolean-values-in-pandas