问题
I have a data frame and can compute a new column of rolling 10 period means
using pandas.stats.moments.rolling_mean(ExistingColumn, 10,
min_periods=10)
. If there are fewer than 10 periods available, I get a NaN. I
can do the same for rolling medians. Perfect.
I'd now like to compute other rolling functions of N periods, but can't for the life of me figure out how to do use a user defined function with Pandas. In particular, I want to compute a rolling 10 point Hodges Lehman Mean, which is defined as follows:
def hodgesLehmanMean(x):
return 0.5 * statistics.median(x[i] + x[j] for i in range(len(x)) for j in range(i+1,len(x)))
How can i turn this into a rolling function that can be applied to a Pandas dataframe and returns a NaN if fewer than 10 periods are passed to it? I'm a Pandas newbie, so I'd be particularly appreciative of a simple explanation with an example.
回答1:
You could use pandas.rolling_apply:
import numpy as np
def hodgesLehmanMean(x):
return 0.5 * np.median([x[i] + x[j]
for i in range(len(x))
for j in range(i+1,len(x))])
df = pd.DataFrame({'foo': np.arange(20, dtype='float')})
df['bar'] = pd.rolling_apply(df['foo'], 10, hodgesLehmanMean)
print(df)
yields
foo bar
0 0 NaN
1 1 NaN
2 2 NaN
3 3 NaN
4 4 NaN
5 5 NaN
6 6 NaN
7 7 NaN
8 8 NaN
9 9 4.5
10 10 5.5
11 11 6.5
12 12 7.5
13 13 8.5
14 14 9.5
15 15 10.5
16 16 11.5
17 17 12.5
18 18 13.5
19 19 14.5
A faster version of hodgesLehmanMean
would be:
def hodgesLehmanMean_alt(x):
m = np.add.outer(x,x)
ind = np.tril_indices(len(x), -1)
return 0.5 * np.median(m[ind])
Here is a sanity-check showing hodgesLehmanMean_alt
returns the same value as hodgesLehmanMean
for 1000 random arrays of length 100:
In [68]: m = np.random.random((1000, 100))
In [69]: all(hodgesLehmanMean(x) == hodgesLehmanMean_alt(x) for x in m)
Out[69]: True
Here is a benchmark showing hodgesLehmanMean_alt
is about 8x faster:
In [80]: x = np.random.random(5000)
In [81]: %timeit hodgesLehmanMean(x)
1 loops, best of 3: 3.99 s per loop
In [82]: %timeit hodgesLehmanMean_alt(x)
1 loops, best of 3: 463 ms per loop
来源:https://stackoverflow.com/questions/27990497/pandas-create-a-new-column-in-a-dataframe-that-is-a-function-of-a-rolling-windo