问题
Similar to this question Exponential Decay on Python Pandas DataFrame, I would like to quickly compute exponentially decaying sums for some columns in a data frame. However, the rows in the data frame are not evenly spaced in time. Hence while exponential_sum[i] = column_to_sum[i] + np.exp(-const*(time[i]-time[i-1])) * exponential_sum[i-1]
, the weight np.exp(...)
does not factor out and it's not obvious to me how to change to that question and still take advantage of pandas/numpy vectorization. Is there a pandas vectorized solution to this problem?
To illustrate the desired calculation, here is a sample frame with the exponential moving sum of A
stored in Sum
using a decay constant of 1:
time A Sum
0 1.00 1 1.000000
1 2.10 3 3.332871
2 2.13 -1 2.234370
3 3.70 7 7.464850
4 10.00 2 2.013708
5 10.20 1 2.648684
回答1:
This question is more complicated than it first appeared. I ended up using numba's jit to compile a generator function to calculate the exponential sums. My end result calculates the exponential sum of 5 million rows in under a second on my computer, which hopefully is fast enough for your needs.
# Initial dataframe.
df = pd.DataFrame({'time': [1, 2.1, 2.13, 3.7, 10, 10.2],
'A': [1, 3, -1, 7, 2, 1]})
# Initial decay parameter.
decay_constant = 1
We can define the decay weights as exp(-time_delta * decay_constant), and set its initial value equal to one:
df['weight'] = np.exp(-df.time.diff() * decay_constant)
df.weight.iat[0] = 1
>>> df
A time weight
0 1 1.00 1.000000
1 3 2.10 0.332871
2 -1 2.13 0.970446
3 7 3.70 0.208045
4 2 10.00 0.001836
5 1 10.20 0.818731
Now we'll use jit from numba to optimize a generator function that calculates the exponential sums:
from numba import jit
@jit(nopython=True)
def exponential_sum(A, k):
total = A[0]
yield total
for i in xrange(1, len(A)): # Use range in Python 3.
total = total * k[i] + A[i]
yield total
We'll use the generator to add the values to the dataframe:
df['expSum'] = list(exponential_sum(df.A.values, df.weight.values))
Which produces the desired output:
>>> df
A time weight expSum
0 1 1.00 1.000000 1.000000
1 3 2.10 0.332871 3.332871
2 -1 2.13 0.970446 2.234370
3 7 3.70 0.208045 7.464850
4 2 10.00 0.001836 2.013708
5 1 10.20 0.818731 2.648684
So let's scale to 5 million rows and check performance:
df = pd.DataFrame({'time': np.random.rand(5e6).cumsum(), 'A': np.random.randint(1, 10, 5e6)})
df['weight'] = np.exp(-df.time.diff() * decay_constant)
df.weight.iat[0] = 1
%%timeit -n 10
df['expSum'] = list(exponential_sum(df.A.values, df.weight.values))
10 loops, best of 3: 726 ms per loop
回答2:
Expanding on the answer you linked to, I came up with the following method.
First, notice that:
exponential_sum[i] = column_to_sum[i] +
np.exp(-const*(time[i]-time[i-1])) * column_to_sum[i-1] +
np.exp(-const*(time[i]-time[i-2])) * column_to_sum[i-2] + ...
So the main change to make is in generating the weightspace to match the formula above. I proceeded like this:
time = pd.Series(np.random.rand(10)).cumsum()
weightspace = np.empty((10,10))
for i in range(len(time)):
weightspace[i] = time - time[i]
weightspace = np.exp(weightspace)
Don't worry about the lower left triangle of the matrix, it won't be used. By the way, there must be a way of generating the weightspace without a loop.
Then a slight change in how you pick the weights from the weightspace in the rolling function:
def rollingsum(array):
weights = weightspace[len(array)-1][:len(array)]
# Convolve the array and the weights to obtain the result
a = np.dot(array, weights).sum()
return a
Works as expected:
dataset = pd.DataFrame(np.random.rand(10,3), columns=["A", "B","C"])
a = pd.expanding_apply(dataset, rollingsum)
来源:https://stackoverflow.com/questions/33294489/pandas-exponentially-decaying-sum-with-variable-weights