问题
I am trying to use a pandas.DataFrame.rolling.apply()
rolling function on multiple columns.
Python version is 3.7, pandas is 1.0.2.
import pandas as pd
#function to calculate
def masscenter(x):
print(x); # for debug purposes
return 0;
#simple DF creation routine
df = pd.DataFrame( [['02:59:47.000282', 87.60, 739],
['03:00:01.042391', 87.51, 10],
['03:00:01.630182', 87.51, 10],
['03:00:01.635150', 88.00, 792],
['03:00:01.914104', 88.00, 10]],
columns=['stamp', 'price','nQty'])
df['stamp'] = pd.to_datetime(df2['stamp'], format='%H:%M:%S.%f')
df.set_index('stamp', inplace=True, drop=True)
'stamp'
is monotonic and unique, 'price'
is double and contains no NaNs, 'nQty'
is integer and also contains no NaNs.
So, I need to calculate rolling 'center of mass', i.e. sum(price*nQty)/sum(nQty)
.
What I tried so far:
df.apply(masscenter, axis = 1)
masscenter
is be called 5 times with a single row and the output will be like
price 87.6
nQty 739.0
Name: 1900-01-01 02:59:47.000282, dtype: float64
It is desired input to a masscenter
, because I can easily access price
and nQty
using x[0], x[1]
. However, I stuck with rolling.apply()
Reading the docs
DataFrame.rolling() and rolling.apply()
I supposed that using 'axis'
in rolling()
and 'raw'
in apply
one achieves similiar behaviour. A naive approach
rol = df.rolling(window=2)
rol.apply(masscenter)
prints row by row (increasing number of rows up to window size)
stamp
1900-01-01 02:59:47.000282 87.60
1900-01-01 03:00:01.042391 87.51
dtype: float64
then
stamp
1900-01-01 02:59:47.000282 739.0
1900-01-01 03:00:01.042391 10.0
dtype: float64
So, columns is passed to masscenter
separately (expected).
Sadly, in the docs there is barely any info about 'axis'
. However the next variant was, obviously
rol = df.rolling(window=2, axis = 1)
rol.apply(masscenter)
Never calls masscenter
and raises ValueError in rol.apply(..)
> Length of passed values is 1, index implies 5
I admit that I'm not sure about 'axis'
parameter and how it works due to lack of documentation. It is the first part of the question:
What is going on here? How to use 'axis' properly? What it is designed for?
Of course, there were answers previously, namely:
How-to-apply-a-function-to-two-columns-of-pandas-dataframe
It works for the whole DataFrame, not Rolling.
How-to-invoke-pandas-rolling-apply-with-parameters-from-multiple-column
The answer suggests to write my own roll function, but the culprit for me is the same as asked in comments: what if one needs to use offset window size (e.g. '1T'
) for non-uniform timestamps?
I don't like the idea to reinvent the wheel from scratch. Also I'd like to use pandas for everything to prevent inconsistency between sets obtained from pandas and 'self-made roll'.
There is another answer to that question, suggessting to populate dataframe separately and calculate whatever I need, but it will not work: the size of stored data will be enormous.
The same idea presented here:
Apply-rolling-function-on-pandas-dataframe-with-multiple-arguments
Another Q & A posted here
Pandas-using-rolling-on-multiple-columns
It is good and the closest to my problem, but again, there is no possibility to use offset window sizes (window = '1T'
).
Some of the answers were asked before pandas 1.0 came out, and given that docs could be much better, I hope it is possible to roll over multiple columns simultaneously now.
The second part of the question is: Is there any possibility to roll over multiple columns simultaneously using pandas 1.0.x with offset window size?
Thank you very much.
回答1:
How about this:
def masscenter(ser):
print(df.loc[ser.index])
return 0
rol = df.price.rolling(window=2)
rol.apply(masscenter, raw=False)
It uses the rolling logic to get subsets from an arbitrary column. The raw=False option provides you with index values for those subsets (which are given to you as Series), then you use those index values to get multi-column slices from your original DataFrame.
回答2:
So I found no way to roll over two columns, however without inbuilt pandas functions. The code is listed below.
# function to find an index corresponding
# to current value minus offset value
def prevInd(series, offset, date):
offset = to_offset(offset)
end_date = date - offset
end = series.index.searchsorted(end_date, side="left")
return end
# function to find an index corresponding
# to the first value greater than current
# it is useful when one has timeseries with non-unique
# but monotonically increasing values
def nextInd(series, date):
end = series.index.searchsorted(date, side="right")
return end
def twoColumnsRoll(dFrame, offset, usecols, fn, columnName = 'twoColRol'):
# find all unique indices
uniqueIndices = dFrame.index.unique()
numOfPoints = len(uniqueIndices)
# prepare an output array
moving = np.zeros(numOfPoints)
# nameholders
price = dFrame[usecols[0]]
qty = dFrame[usecols[1]]
# iterate over unique indices
for ii in range(numOfPoints):
# nameholder
pp = uniqueIndices[ii]
# right index - value greater than current
rInd = afta.nextInd(dFrame,pp)
# left index - the least value that
# is bigger or equal than (pp - offset)
lInd = afta.prevInd(dFrame,offset,pp)
# call the actual calcuating function over two arrays
moving[ii] = fn(price[lInd:rInd], qty[lInd:rInd])
# construct and return DataFrame
return pd.DataFrame(data=moving,index=uniqueIndices,columns=[columnName])
This code works, but it is relatively slow and inefficient. I suppose one can use numpy.lib.stride_tricks from How to invoke pandas.rolling.apply with parameters from multiple column? to speedup things.
However, go big or go home - I ended writing a function in C++ and a wrapper for it.
I'd like not to post it as answer, since it is a workaround and I have not answered neither part of my question, but it is too long for a commentary.
回答3:
You can use rolling_apply function from numpy_ext module:
import numpy as np
import pandas as pd
from numpy_ext import rolling_apply
def masscenter(price, nQty):
return np.sum(price * nQty) / np.sum(nQty)
df = pd.DataFrame( [['02:59:47.000282', 87.60, 739],
['03:00:01.042391', 87.51, 10],
['03:00:01.630182', 87.51, 10],
['03:00:01.635150', 88.00, 792],
['03:00:01.914104', 88.00, 10]],
columns=['stamp', 'price','nQty'])
df['stamp'] = pd.to_datetime(df['stamp'], format='%H:%M:%S.%f')
df.set_index('stamp', inplace=True, drop=True)
window = 2
df['y'] = rolling_apply(masscenter, window, df.price.values, df.nQty.values)
print(df)
price nQty y
stamp
1900-01-01 02:59:47.000282 87.60 739 NaN
1900-01-01 03:00:01.042391 87.51 10 87.598798
1900-01-01 03:00:01.630182 87.51 10 87.510000
1900-01-01 03:00:01.635150 88.00 792 87.993890
1900-01-01 03:00:01.914104 88.00 10 88.000000
来源:https://stackoverflow.com/questions/60736556/pandas-rolling-apply-using-multiple-columns