问题
My algorithm stepped up from 35 seconds to 15 minutes runtime when implementing this feature over a daily timeframe. The algo retrieves daily history in bulk and iterates over a subset of the dataframe (from t0 to tX where tX is the current row of iteration). It does this to emulate what would happen during the real time operations of the algo. I know there are ways of improving it by utilizing memory between frame calculations but I was wondering if there was a more pandas-ish implementation that would see immediate benefit.
Assume that self.Step
is something like 0.00001
and self.Precision
is 5
; they are used for binning the ohlc bar information into discrete steps for the sake of finding the poc. _frame
is a subset of rows of the entire dataframe, and _low
/_high
are respective to that. The following block of code executes on the entire _frame
which could be upwards of ~250 rows every time there is a new row added by the algo (when calculating yearly timeframe on daily data). I believe it's the iterrows that's causing the major slowdown. The dataframe has columns such as high
, low
, open
, close
, volume
. I am calculating time price opportunity and volume point of control.
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - self.Step, _high + self.Step, self.Step), decimals=self.Precision))
time_prices = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around(np.arange(state.low, state.high, self.Step), decimals=self.Precision)
# Evenly distribute the bar's volume over its range
volume_prices[_prices] += state.volume / _prices.size
# Increment time at price
time_prices[_prices] += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax()) / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax()) / 2)
回答1:
you can use this function as a base and to adjust it:
def f(x): #function to find the POC price and volume
a = x['tradePrice'].value_counts().index[0]
b = x.loc[x['tradePrice'] == a, 'tradeVolume'].sum()
return pd.Series([a,b],['POC_Price','POC_Volume'])
回答2:
Here's what I worked out. I'm still not sure the answer you code is producing is correct, I think your line volume_prices[_prices] += state.Volume / _prices.size
is not being applied to every record in volume_prices, but here it is with benchmarking. About a 9x improvement.
def vpOriginal():
Step = 0.00001
Precision = 5
_frame = getData()
_low = 85.0
_high = 116.4
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - Step, _high + Step, Step), decimals=Precision))
time_prices = volume_prices.copy()
time_prices2 = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around(np.arange(state.Low, state.High, Step), decimals=Precision)
# Evenly distribute the bar's volume over its range
volume_prices[_prices] += state.Volume / _prices.size
# Increment time at price
time_prices[_prices] += 1
time_prices2 += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
# print(volume_prices.head(10))
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax() / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax() / 2)
return volume_poc, time_poc
def vpNoDF():
Step = 0.00001
Precision = 5
_frame = getData()
_low = 85.0
_high = 116.4
# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - Step, _high + Step, Step), decimals=Precision))
time_prices = volume_prices.copy()
for index, state in _frame.iterrows():
_prices = np.around((state.High - state.Low) / Step , 0)
# Evenly distribute the bar's volume over its range
volume_prices.loc[state.Low:state.High] += state.Volume / _prices
# Increment time at price
time_prices.loc[state.Low:state.High] += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax() / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax() / 2)
return volume_poc, time_poc
getData()
Out[8]:
Date Open High Low Close Volume Adj Close
0 2008-10-14 116.26 116.40 103.14 104.08 70749800 104.08
1 2008-10-13 104.55 110.53 101.02 110.26 54967000 110.26
2 2008-10-10 85.70 100.00 85.00 96.80 79260700 96.80
3 2008-10-09 93.35 95.80 86.60 88.74 57763700 88.74
4 2008-10-08 85.91 96.33 85.68 89.79 78847900 89.79
5 2008-10-07 100.48 101.50 88.95 89.16 67099000 89.16
6 2008-10-06 91.96 98.78 87.54 98.14 75264900 98.14
7 2008-10-03 104.00 106.50 94.65 97.07 81942800 97.07
8 2008-10-02 108.01 108.79 100.00 100.10 57477300 100.10
9 2008-10-01 111.92 112.36 107.39 109.12 46303000 109.12
vpOriginal()
Out[9]: (142.55000000000001, 142.55000000000001)
vpNoDF()
Out[10]: (142.55000000000001, 142.55000000000001)
%timeit vpOriginal()
2.79 s ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vpNoDF()
300 ms ± 8.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
回答3:
I've managed to get it down to 2 mins instead of 15 - at on daily timeframes anyway. It's still slow on lower timeframes (10 minutes on Hourly over a 2 year period with a precision of 2 for equities). Working with DataFrames as opposed to Series was FAR slower. I'm hoping for more but I don't know what I can do aside from the following solution:
# Upon class instantiation, I've created attributes for each timeframe
# related to `volume_at_price` and `time_at_price`. They serve as memory
# in between frame calculations
def _prices_at(self, frame, bars=0):
# Include 1 step above high as np.arange does not
# include the upper limit by default
state = frame.iloc[-min(bars + 1, frame.index.size)]
bins = np.around(np.arange(state.low, state.high + self.Step, self.Step), decimals=self.Precision)
return pd.Series(state.volume / bins.size, index=bins)
# SetFeature/Feature implement timeframed attributes (i.e., 'volume_at_price_D')
_v = 'volume_at_price'
_t = 'time_at_price'
# Add to x_at_price histogram
_p = self._prices_at(frame)
self.SetFeature(_v, self.Feature(_v).add(_p, fill_value=0))
self.SetFeature(_t, self.Feature(_t).add(_p * 0 + 1, fill_value=0))
# Remove old data from histogram
_p = self._prices_at(frame, self.Bars)
v = self.SetFeature(_v, self.Feature(_v).subtract(_p, fill_value=0))
t = self.SetFeature(_t, self.Feature(_t).subtract(_p * 0 + 1, fill_value=0))
self.SetFeature('volume_poc', (v.idxmax() + v.iloc[::-1].idxmax()) / 2)
self.SetFeature('time_poc', (t.idxmax() + t.iloc[::-1].idxmax()) / 2)
来源:https://stackoverflow.com/questions/60578058/efficiently-calculating-point-of-control-with-pandas