How to do cumsum based on a time condition - resample pandas?

会有一股神秘感。 提交于 2019-12-23 23:20:00

问题


I have a dataframe like as shown below

df = pd.DataFrame({
   'subject_id':[1,1,1,1,1,1],
   'time_1' :['2173-04-03 10:00:00','2173-04-03 10:15:00','2173-04-03 
              10:30:00','2173-04-03 10:45:00','2173-04-03 11:05:00','2173- 
              04-03 11:15:00'],
   'val' :[5,6,5,6,6,6]
})

I would like to find the total duration of a value appearing in sequence. Below example will help you understand

From the above screenshot, you can see that 6 occurs in sequence from 10:45 to 23:59 whereas other values (it could be any values in real time though) are not in sequence at all.

I did something like this but doesn't give expected output. It cumsums all values

df['time_1'] = pd.to_datetime(df['time_1'])
df['seq'] = df['val'] == df['val'].shift(-1)

s=pd.to_timedelta(24,unit='h')-(df.time_1-df.time_1.dt.normalize())
df['tdiff'] =df.groupby(df.time_1.dt.date).time_1.diff().shift(-1).fillna(s).dt.total_seconds()/3600
df.groupby([df['seq'] == True])['tdiff'].cumsum() # do cumulative sum only when the values are in sequence

How can I do cumulative sum to a group based on a condition?

I expect my output to be like as shown below. You see 13:15 because we don't see any other value in our data for next 13:15 hour from first occurrence of 6 which is at 10:45 (24:00 hr - 10:45 gives 13:15)

Test dataframe

df = pd.DataFrame({
     'subject_id':[1,1,1,1,1,1,1,1,1,1,1],
     'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-03 
     12:59:00','2173-04-03 13:14:00','2173-04-03 13:37:00','2173-04-04 
     11:30:00','2173-04-05 16:00:00','2173-04-05 22:00:00','2173-04-06 
     04:00:00','2173-04-06 04:30:00','2173-04-06 08:00:00'],
     'val' :[5,5,5,5,10,5,5,8,3,4,6]
   })

回答1:


IIUC, Try with :

m=df.groupby(df.val.ne(df.val.shift()).cumsum()).first().rename_axis(None)
c=pd.to_timedelta(24,unit='h')-(m.time_1-m.time_1.dt.normalize())
final=m.assign(cumsum=m.time_1.diff().shift(-1).fillna(c))

   subject_id              time_1  val   cumsum
1           1 2173-04-03 10:00:00    5 00:15:00
2           1 2173-04-03 10:15:00    6 00:15:00
3           1 2173-04-03 10:30:00    5 00:15:00
4           1 2173-04-03 10:45:00    6 13:15:00

Details:

df.val.ne(df.val.shift()).cumsum() evaluates if values changes every row , and groups same values into a single group.

Based on this group we groupby and get first entry of each group. Then we find diff() from time_1 and shift 1 place above to align to the top index. fillna with difference from 24 hrs.




回答2:


1) first you should convert to datetime your column time:

df.time_1 = pd.to_datetime(df.time_1)

2) then you can group by consecutive repetitive values:

df['val_groups'] = (df.val != df.val.shift()).cumsum()

3) also, you need for each group the time till next value:

df['time_till_next_val'] = df.time_1.diff().shift(-1)

4) next will be to group by the consecutive value groups and calculate your consum column:

 cols = ['subject_id', 'time_1', 'val', 'consum']
 df_consum = df.groupby(['subject_id', 'val', 'val_groups']).agg(consum=('time_till_next_val', 'sum'), time_1=('time_1', 'first')).reset_index()[cols]

5) calculate for the last group the consum value

last_start_time_group = df.time_1.iloc[df.val_groups.eq(df.val_groups.max()).idxmax()]
last_start_time_group = pd.to_timedelta(last_start_time_group.strftime('%H:%M:%S'), unit='d')
last_group_consum = pd.Timedelta(hours=24) - last_start_time_group

df_consum.consum.iloc[-1] = last_group_consum
df_consum

output:



来源:https://stackoverflow.com/questions/57735722/how-to-do-cumsum-based-on-a-time-condition-resample-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!