How to manipulate data in arrays using pandas (and resetting evaluations)

问题

I've revised the question for clarity and removed artifacts and inconsistencies - please reopen for consideration by the community. One contributor already thinks a solution might be possible with groupby in combination with cummax.

I have a dataframe in which the max between prior value of col3 and current value of col2 is evaluated through a cummax function recently offered by Scott Boston (thanks!) as follows:

df['col3'] = df['col2'].shift(-1).cummax().shift().

The resulting dataframe is shown below. Also added the desired logic that compares col2 to a setpoint that is a result of float type value.

result of operating cummax:

   col0  col1  col2  col3
0     1   5.0  2.50   NaN
1     2   4.9  2.45  2.45
2     3   5.5  2.75  2.75
3     4   3.5  1.75  2.75
4     5   3.1  1.55  2.75
5     6   4.5  2.25  2.75
6     7   5.5  2.75  2.75
7     8   1.2  0.6   2.75
8     9   5.8  2.90  2.90

The desire is to flag True when col3 >= setpoint or 2.71 in the above example such that every time col3's most recent row exceeds setpoint.

The problem: cummax solution does not reset when setpoint is reached. Need a solution that resets the cummax calculation every time it breaches setpoint. For example in the table above, after the first True when col3 exceeds the setpoint, i.e. col2 value is 2.75, there is a second time when it should satisfy the same condition, i.e. shown as in the extended data table where I’ve deleted col3's value in row 4 to illustrate the need to ‘reset’ the cummax calc. In the if statement, I am using subscript [-1] to target the last row in the df (i.e. most recent). Note: col2=current value of col1*constant1 where constant1 == 0.5

Code tried so far (note that col3 is not resetting properly):

if self.constant is not None: setpoint = self.constant * (1-self.temp)  # suppose setpoint == 2.71
df = pd.DataFrame({'col0':[1,2,3,4,5,6,7,8,9]
              ,'col1':[5,4.9,5.5,3.5,3.1,4.5,5.5,1.2,5.8]
              ,'col2':[2.5,2.45,2.75,1.75,1.55,2.25,2.75,0.6,2.9]
              ,'col3':[NaN,2.45,2.75,2.75,2.75,2.75,2.75,2.75,2.9]
              })

if df[‘col3’][-1] >= setpoint:
    self.log(‘setpoint hit')
    return True

Cummax solution needs tweaking: col3 is supposed to evaluate based value of col2 and col3 and once the setpoint is breached (2.71 for col3), the next col3 value should reset to NaN and start a new cummax. The correct output for col3 should be:[NaN,2.45,2.75,NaN,1.55,2.25,2.75,NaN,2.9] and return True again and again when the last row of col3 breaches setpoint value 2.71.

Desired result of operating cummax and additional tweaking for col3 (possibly with groupby that references col2?): return True every time setpoint is breached. Here's one example of the resulting col3:

   col0  col1  col2  col3
0     1   5.0  2.50   NaN
1     2   4.9  2.45  2.45
2     3   5.5  2.75  2.75
3     4   3.5  1.75   NaN
4     5   3.1  1.55  1.55
5     6   4.5  2.25  2.25
6     7   5.5  2.75  2.75
7     8   1.2  0.60   NaN
8     9   5.8  2.90  2.90

Open to suggestions on whether NaN is returned on the row the breach occurs or on next row shown as above (key desire is for if statement to resolve True as soon as setpoint is breached).

回答1:

Try:

import pandas as pd
import numpy as np

df = pd.DataFrame({'col0':[1,2,3,4,5,6,7,8,9]
              ,'col1':[5,4.9,5.5,3.5,3.1,4.5,5.5,1.2,5.8]
              ,'col2':[2.5,2.45,2.75,1.75,1.55,2.25,2.75,0.6,2.9]
              ,'col3':[np.nan,2.45,2.75,2.75,2.75,2.75,2.75,2.75,2.9]
              })


threshold = 2.71

grp = df['col2'].ge(threshold).cumsum().shift().bfill()

df['col3'] = df['col2'].groupby(grp).transform(lambda x: x.shift(-1).cummax().shift())

print(df)

Output:

   col0  col1  col2  col3
0     1   5.0  2.50   NaN
1     2   4.9  2.45  2.45
2     3   5.5  2.75  2.75
3     4   3.5  1.75   NaN
4     5   3.1  1.55  1.55
5     6   4.5  2.25  2.25
6     7   5.5  2.75  2.75
7     8   1.2  0.60   NaN
8     9   5.8  2.90  2.90

Details:

Create grouping using greater or equal to threshold, then apply the same logic to each group withn at the dataframe using groupby with transform.

来源：https://stackoverflow.com/questions/59549685/how-to-manipulate-data-in-arrays-using-pandas-and-resetting-evaluations

标签

pandas

dataframe

python-3.6