问题
I believe the following function is a working solution for pandas DataFrame rolling argmin/max:
import numpy as np
def data_frame_rolling_arg_func(df, window_size, func):
ws = window_size
wm1 = window_size - 1
return (df.rolling(ws).apply(getattr(np, f'arg{func}'))[wm1:].astype(int) +
np.array([np.arange(len(df) - wm1)]).T).applymap(
lambda x: df.index[x]).combine_first(df.applymap(lambda x: np.NaN))
It is inspired from a partial solution for rolling idxmax on pandas Series.
Explanations:
- Apply the numpy argmin/max function to the rolling window.
- Only keep the non-
NaN
values. - Convert the values to
int
. - Realign the values to original row numbers.
- Use
applymap
to replace the row numbers by the index values. - Combine with the original
DataFrame
filled withNaN
in order to add the first rows with expectedNaN
values.
In [1]: index = map(chr, range(ord('a'), ord('a') + 10))
In [2]: df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)
In [3]: df
Out[3]:
0 1 2
a -4 15 0
b 0 -6 4
c 7 8 -18
d 11 12 -16
e 6 3 -6
f -1 4 -9
g 6 -10 -7
h 8 11 -25
i -2 -10 -8
j 0 10 -7
In [4]: data_frame_rolling_arg_func(df, 3, 'max')
Out[4]:
0 1 2
a NaN NaN NaN
b NaN NaN NaN
c c a b
d d d b
e d d e
f d d e
g e f e
h h h g
i h h g
j h h j
In [5]: data_frame_rolling_arg_func(df, 3, 'min')
Out[5]:
0 1 2
a NaN NaN NaN
b NaN NaN NaN
c a b c
d b b c
e e e c
f f e d
g f g f
h f g h
i i g h
j i i h
My question are:
- Can you find any mistakes?
- Is there a better solution? That is: more performant and/or more elegant.
And for pandas maintainers out there: it would be nice if the already great pandas library included rolling idxmax and idxmin.
回答1:
The NaN
issue I mentioned in a comment to the OP can be solved in the following manner:
import numpy as np
import pandas as pd
def data_frame_rolling_idx_func(df, window_size, func):
ws = window_size
wm1 = window_size - 1
return (df.rolling(ws, min_periods=0).apply(getattr(np, f'arg{func}'),
raw=True)[wm1:].astype(int) +
np.array([np.arange(len(df) - wm1)]).T).applymap(
lambda x: df.index[x]).combine_first(df.applymap(lambda x: np.NaN))
def main():
index = map(chr, range(ord('a'), ord('a') + 10))
df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)
df[0][3:6] = np.NaN
print(df)
print(data_frame_rolling_arg_func(df, 3, 'min'))
print(data_frame_rolling_arg_func(df, 3, 'max'))
if __name__ == "__main__":
main()
Result:
$ python demo.py
0 1 2
a 3.0 0 7
b 1.0 3 11
c 1.0 15 -6
d NaN 2 -16
e NaN 0 24
f NaN 0 14
g 2.0 0 4
h -1.0 -11 16
i 17.0 0 -2
j 3.0 -5 -8
0 1 2
a NaN NaN NaN
b NaN NaN NaN
c b a c
d d d d
e d e d
f d e d
g e e g
h f h g
i h h i
j h h j
0 1 2
a NaN NaN NaN
b NaN NaN NaN
c a c b
d d c b
e d c e
f d d e
g e e e
h f f h
i i g h
j i i h
The handling of NaN
values is a little subtle. I want my rolling idxmin/max
function to cooperate well with the regular DataFrame
rolling min
/max
functions. These, by default, will generate a NaN
value as soon as the window input shows a NaN
value. And so will the rolling apply
function by default. But for the apply function, that is a problem, because I will not be able to transform the NaN
value into an index. However this is a pity, since the NaN
values in the output show up because they can be found in the input, so the NaN
value index in the input is what I would like my rolling idxmin/max
function to produce. Fortunately, this is exactly what I will get if I use the following combination of parameters:
min_periods=0
for the pandasrolling
function. Theapply
function will then get a chance to produce its own value regardless of how manyNaN
values are found in the input window.raw=True
for theapply
function. This parameter ensures that the input to the applied function is passed as a numpy array instead of a pandas Series.np.argmin/max
will then return the index of the first inputNaN
value, which is exactly what we want. It should be noted that withoutraw=True
, i.e. in the pandas Series case,np.argmin/max
seems to ignore theNaN
values, which is NOT what we want. The nice thing withraw=True
is that it should improve performance too! More about that later.
回答2:
The solution in my previous answer manages to give proper index values for NaN
input values, but I have realized that this is most probably not what a native pandas rolling idxmin
/idxmax
would do by default. By default, it would produce a NaN
value if there is one or more NaN
values in the window.
I came up with a variant of my solution, which does that:
import numpy as np
import pandas as pd
def transform_if_possible(func):
def f(i):
try:
return func(i)
except ValueError:
return i
return f
int_if_possible = transform_if_possible(int)
def data_frame_rolling_idx_func(df, window_size, func):
ws = window_size
wm1 = window_size - 1
index_if_possible = transform_if_possible(lambda i: df.index[i])
return (df.rolling(ws).apply(getattr(np, f'arg{func}'), raw=True).applymap(int_if_possible) +
np.array([np.arange(len(df)) - wm1]).T).applymap(index_if_possible)
def main():
print(int_if_possible(1.2))
print(int_if_possible(np.NaN))
index = map(chr, range(ord('a'), ord('a') + 10))
df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)
df[0][3:6] = np.NaN
print(df)
print(data_frame_rolling_idx_func(df, 3, 'min'))
print(data_frame_rolling_idx_func(df, 3, 'max'))
if __name__ == "__main__":
main()
Results:
1
nan
0 1 2
a 15.0 -2 13
b -6.0 -4 -3
c -12.0 -7 -8
d NaN 0 -4
e NaN -1 -11
f NaN -9 10
g -1.0 24 1
h -15.0 14 -16
i 7.0 -4 14
j -1.0 4 10
0 1 2
a NaN NaN NaN
b NaN NaN NaN
c c c c
d NaN c c
e NaN c e
f NaN f e
g NaN f e
h NaN f h
i h i h
j h i h
0 1 2
a NaN NaN NaN
b NaN NaN NaN
c a a a
d NaN d b
e NaN d d
f NaN d f
g NaN g f
h NaN g f
i i g i
j i h i
To achieve my goal, I am using two functions to transform values into integers, and row numbers into index values, respectively, which leave NaN
unchanged. I construct these functions with the help of a common closure, transform_if_possible
. In the second case, since the index transformation is dependent on the DataFrame
, I construct the transformation function from a local lambda function.
Apart from these aspects, the solution is similar to my previous one, but since NaN
is explicitly handled, I know longer need a special handling of the first window_size - 1
rows, so the code is a little shorter.
A nice side effect of this solution is that the running time seems to be lower: a little over three times the running time of the corresponding rolling min
/max
, instead of five times.
All in all, a better solution I think.
来源:https://stackoverflow.com/questions/65526535/rolling-idxmin-max-for-pandas-dataframe