问题
Problem
I'm trying to keep a running total of consecutive timestamps (minute frequency). I currently have a way of taking a cumulative sum and resetting it on the condition that two columns do not match, but its done with a for loop. I was wondering if there is a way to do this without the loop.
Code
cb_arbitrage['shift'] = cb_arbitrage.index.shift(1, freq='T')
Returns:
cccccccc bbbbbbbb cb_spread shift
timestamp
2017-07-07 18:23:00 2535.002000 2524.678462 10.323538 2017-07-07 18:24:00
2017-07-07 18:24:00 2535.007826 2523.297619 11.710207 2017-07-07 18:25:00
2017-07-07 18:25:00 2535.004167 2524.391000 10.613167 2017-07-07 18:26:00
2017-07-07 18:26:00 2534.300000 2521.838667 12.461333 2017-07-07 18:27:00
2017-07-07 18:27:00 2530.231429 2520.195625 10.035804 2017-07-07 18:28:00
2017-07-07 18:28:00 2529.444667 2518.782143 10.662524 2017-07-07 18:29:00
2017-07-07 18:29:00 2528.988000 2518.802963 10.185037 2017-07-07 18:30:00
2017-07-07 18:59:00 2514.403367 2526.473333 12.069966 2017-07-07 19:00:00
2017-07-07 19:01:00 2516.410000 2528.980000 12.570000 2017-07-07 19:02:00
Then I do the following:
cb_arbitrage['shift'] = cb_arbitrage['shift'].shift(1)
cb_arbitrage['shift'][0] = cb_arbitrage.index[0]
cb_arbitrage['count'] = 0
Which returns:
cccccccc bbbbbbbb cb_spread shift count
timestamp
2017-07-07 18:23:00 2535.002000 2524.678462 10.323538 2017-07-07 18:23:00 0
2017-07-07 18:24:00 2535.007826 2523.297619 11.710207 2017-07-07 18:24:00 0
2017-07-07 18:25:00 2535.004167 2524.391000 10.613167 2017-07-07 18:25:00 0
2017-07-07 18:26:00 2534.300000 2521.838667 12.461333 2017-07-07 18:26:00 0
2017-07-07 18:27:00 2530.231429 2520.195625 10.035804 2017-07-07 18:27:00 0
2017-07-07 18:28:00 2529.444667 2518.782143 10.662524 2017-07-07 18:28:00 0
2017-07-07 18:29:00 2528.988000 2518.802963 10.185037 2017-07-07 18:29:00 0
2017-07-07 18:59:00 2514.403367 2526.473333 12.069966 2017-07-07 18:30:00 0
2017-07-07 19:01:00 2516.410000 2528.980000 12.570000 2017-07-07 19:00:00 0
Then, the loop to calculate the cumulative sum, with reset:
count = 0
for i, row in cb_arbitrage.iterrows():
if i == cb_arbitrage.loc[i]['shift']:
count += 1
cb_arbitrage.set_value(i, 'count', count)
else:
count = 1
cb_arbitrage.set_value(i, 'count', count)
Which gives me my expected result:
cccccccc bbbbbbbb cb_spread shift count
timestamp
2017-07-07 18:23:00 2535.002000 2524.678462 10.323538 2017-07-07 18:23:00 1
2017-07-07 18:24:00 2535.007826 2523.297619 11.710207 2017-07-07 18:24:00 2
2017-07-07 18:25:00 2535.004167 2524.391000 10.613167 2017-07-07 18:25:00 3
2017-07-07 18:26:00 2534.300000 2521.838667 12.461333 2017-07-07 18:26:00 4
2017-07-07 18:27:00 2530.231429 2520.195625 10.035804 2017-07-07 18:27:00 5
2017-07-07 18:28:00 2529.444667 2518.782143 10.662524 2017-07-07 18:28:00 6
2017-07-07 18:29:00 2528.988000 2518.802963 10.185037 2017-07-07 18:29:00 7
2017-07-07 18:59:00 2514.403367 2526.473333 12.069966 2017-07-07 18:30:00 1
2017-07-07 19:01:00 2516.410000 2528.980000 12.570000 2017-07-07 19:00:00 1
2017-07-07 21:55:00 2499.904560 2510.814000 10.909440 2017-07-07 19:02:00 1
2017-07-07 21:56:00 2500.134615 2510.812857 10.678242 2017-07-07 21:56:00 2
回答1:
You can use the diff
method which finds the difference between the current row and previous row. You can then check and see if this difference is equal to one minute. From here, there is lots of trickery to reset streaks within data.
We first take the cumulative sum of the boolean Series, which gets us close to what we want. To reset the series we multiply this cumulative sum series by the original boolean, since False evaluates as 0.
s = cb_arbitrage.timestamp.diff() == pd.Timedelta('1 minute')
s1 = s.cumsum()
s.mul(s1).diff().where(lambda x: x < 0).ffill().add(s1, fill_value=0) + 1
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 1.0
8 1.0
9 1.0
10 2.0
来源:https://stackoverflow.com/questions/46144380/pandas-taking-cumulative-sum-with-reset