aggregating hourly time series by Day via pd.TimeGrouper('D'); issue @ timestamp 00:00:00 (hour 24)

不问归期 提交于 2020-01-15 10:05:42

问题


df:

                    hour    rev
datetime        
2016-05-01 01:00:00 1   -0.02
2016-05-01 02:00:00 2   -0.01
2016-05-01 03:00:00 3   -0.02
2016-05-01 04:00:00 4   -0.02
2016-05-01 05:00:00 5   -0.01
2016-05-01 06:00:00 6   -0.03
2016-05-01 07:00:00 7   -0.10
2016-05-01 08:00:00 8   -0.09
2016-05-01 09:00:00 9   -0.08
2016-05-01 10:00:00 10  -0.10
2016-05-01 11:00:00 11  -0.12
2016-05-01 12:00:00 12  -0.14
2016-05-01 13:00:00 13  -0.17
2016-05-01 14:00:00 14  -0.16
2016-05-01 15:00:00 15  -0.15
2016-05-01 16:00:00 16  -0.15
2016-05-01 17:00:00 17  -0.17
2016-05-01 18:00:00 18  -0.16
2016-05-01 19:00:00 19  -0.18
2016-05-01 20:00:00 20  -0.17
2016-05-01 21:00:00 21  -0.14
2016-05-01 22:00:00 22  -0.16
2016-05-01 23:00:00 23  -0.08
2016-05-02 00:00:00 24  -0.06

df.reset_index().to_dict('rec'):

[{'datetime': Timestamp('2016-05-01 01:00:00'), 'hour': 1L, 'rev': -0.02},
 {'datetime': Timestamp('2016-05-01 02:00:00'), 'hour': 2L, 'rev': -0.01},
 {'datetime': Timestamp('2016-05-01 03:00:00'), 'hour': 3L, 'rev': -0.02},
 {'datetime': Timestamp('2016-05-01 04:00:00'), 'hour': 4L, 'rev': -0.02},
 {'datetime': Timestamp('2016-05-01 05:00:00'), 'hour': 5L, 'rev': -0.01},
 {'datetime': Timestamp('2016-05-01 06:00:00'), 'hour': 6L, 'rev': -0.03},
 {'datetime': Timestamp('2016-05-01 07:00:00'), 'hour': 7L, 'rev': -0.1},
 {'datetime': Timestamp('2016-05-01 08:00:00'), 'hour': 8L, 'rev': -0.09},
 {'datetime': Timestamp('2016-05-01 09:00:00'), 'hour': 9L, 'rev': -0.08},
 {'datetime': Timestamp('2016-05-01 10:00:00'), 'hour': 10L, 'rev': -0.1},
 {'datetime': Timestamp('2016-05-01 11:00:00'), 'hour': 11L, 'rev': -0.12},
 {'datetime': Timestamp('2016-05-01 12:00:00'), 'hour': 12L, 'rev': -0.14},
 {'datetime': Timestamp('2016-05-01 13:00:00'), 'hour': 13L, 'rev': -0.17},
 {'datetime': Timestamp('2016-05-01 14:00:00'), 'hour': 14L, 'rev': -0.16},
 {'datetime': Timestamp('2016-05-01 15:00:00'), 'hour': 15L, 'rev': -0.15},
 {'datetime': Timestamp('2016-05-01 16:00:00'), 'hour': 16L, 'rev': -0.15},
 {'datetime': Timestamp('2016-05-01 17:00:00'), 'hour': 17L, 'rev': -0.17},
 {'datetime': Timestamp('2016-05-01 18:00:00'), 'hour': 18L, 'rev': -0.16},
 {'datetime': Timestamp('2016-05-01 19:00:00'), 'hour': 19L, 'rev': -0.18},
 {'datetime': Timestamp('2016-05-01 20:00:00'), 'hour': 20L, 'rev': -0.17},
 {'datetime': Timestamp('2016-05-01 21:00:00'), 'hour': 21L, 'rev': -0.14},
 {'datetime': Timestamp('2016-05-01 22:00:00'), 'hour': 22L, 'rev': -0.16},
 {'datetime': Timestamp('2016-05-01 23:00:00'), 'hour': 23L, 'rev': -0.08},
 {'datetime': Timestamp('2016-05-02 00:00:00'), 'hour': 24L, 'rev': -0.06}]

df.set_index('datetime', inplace=True)

I want to aggregate the data by DAY. So I do:

dfgrped = df.groupby([pd.TimeGrouper('D')])

I want to compute stats like the sum:

dfgrped.agg(sum)

            hour    rev
datetime        
2016-05-01  276 -2.43
2016-05-02  24  -0.06

As you can see the aggregation occurs for 2016-05-01 and 2016-05-02.

Notice, that the last hourly data entry in df occurs at 2016-05-02 00:00:00, which is meant to be the data for the last hour of the previous day i.e. 24 hourly data points for each day.

However, given the datetime stamp, things don't work out the way I intended. I want all 24 hours to be aggregated for 2016-05-01.

I imagine this sort of issue must arise often in various applications when a measurement is taken at the end of the hour. This isn't a problem until the last hour, which occurs at the 00:00:00 timestamp of the following day.

How to address this issue in pandas?


回答1:


it looks like another hack, but it should do the job:

In [79]: df.assign(t=df.datetime - pd.Timedelta(hours=1)).drop('datetime',1).groupby(pd.TimeGrouper('D', key='t')).sum()
Out[79]:
            hour   rev
t
2016-05-01   300 -2.49



回答2:


A little bit hack solution, if your starting point for each day is larger than one second, you can subtract one second from the date time column and then groupby date, which seems to work for your case:

from datetime import timedelta
import pandas as pd
df.groupby((df.datetime - timedelta(seconds = 1)).dt.date).sum()

#             hour    rev
#   datetime        
# 2016-05-01   300  -2.49



回答3:


Simply .shift(-1) or .roll(-1), the rev column, backward one. So timestamp would be period start vs period end. You would need to add one timestamp.



来源:https://stackoverflow.com/questions/39065034/aggregating-hourly-time-series-by-day-via-pd-timegrouperd-issue-timestamp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!