问题
Could someone please point me in the right direction with respect to OHLC data timeframe conversion with Pandas? What I'm trying to do is build a Dataframe with data for higher timeframes, given data with lower timeframe.
For example, given I have the following one-minute (M1) data:
Open High Low Close Volume
Date
1999-01-04 10:22:00 1.1801 1.1819 1.1801 1.1817 4
1999-01-04 10:23:00 1.1817 1.1818 1.1804 1.1814 18
1999-01-04 10:24:00 1.1817 1.1817 1.1802 1.1806 12
1999-01-04 10:25:00 1.1807 1.1815 1.1795 1.1808 26
1999-01-04 10:26:00 1.1803 1.1806 1.1790 1.1806 4
1999-01-04 10:27:00 1.1801 1.1801 1.1779 1.1786 23
1999-01-04 10:28:00 1.1795 1.1801 1.1776 1.1788 28
1999-01-04 10:29:00 1.1793 1.1795 1.1782 1.1789 10
1999-01-04 10:31:00 1.1780 1.1792 1.1776 1.1792 12
1999-01-04 10:32:00 1.1788 1.1792 1.1788 1.1791 4
which has Open, High, Low, Close (OHLC) and volume values for every minute I would like to build a set of 5-minute readings (M5) which would look like so:
Open High Low Close Volume
Date
1999-01-04 10:25:00 1.1807 1.1815 1.1776 1.1789 91
1999-01-04 10:30:00 1.1780 1.1792 1.1776 1.1791 16
So the workflow is that:
- Open is the Open of the first row in the timewindow
- High is the highest High in the timewindow
- Low is the lowest Low
- Close is the last Close
- Volume is simply a sum of Volumes
There are few issues though:
- the data has gaps ( note there is no 10:30:00 row)
- the 5-minute intervals have to start at round time, e.g. M5 starts at 10:25:00 not 10:22:00
- first, incomplete set can be omitted like in this example, or included (so we could have 10:20:00 5-minute entry)
The Pandas documentation on up-down sampling gives an example, but they use mean value as the value of up-sampled row, which won't work here. I have tried using groupby
and agg
but to no avail. For one getting highest High and lowest Low might be not so hard, but I have no idea how to get first Open and last Close.
What I tried is something along the lines of:
grouped = slice.groupby( dr5minute.asof ).agg(
{ 'Low': lambda x : x.min()[ 'Low' ], 'High': lambda x : x.max()[ 'High' ] }
)
but it results in following error, which I don't understand:
In [27]: grouped = slice.groupby( dr5minute.asof ).agg( { 'Low' : lambda x : x.min()[ 'Low' ], 'High' : lambda x : x.max()[ 'High' ] } )
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/work/python/fxcruncher/<ipython-input-27-df50f9522a2f> in <module>()
----> 1 grouped = slice.groupby( dr5minute.asof ).agg( { 'Low' : lambda x : x.min()[ 'Low' ], 'High' : lambda x : x.max()[ 'High' ] } )
/usr/lib/python2.7/site-packages/pandas/core/groupby.pyc in agg(self, func, *args, **kwargs)
242 See docstring for aggregate
243 """
--> 244 return self.aggregate(func, *args, **kwargs)
245
246 def _iterate_slices(self):
/usr/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, arg, *args, **kwargs)
1153 colg = SeriesGroupBy(obj[col], column=col,
1154 grouper=self.grouper)
-> 1155 result[col] = colg.aggregate(func)
1156
1157 result = DataFrame(result)
/usr/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, func_or_funcs, *args, **kwargs)
906 return self._python_agg_general(func_or_funcs, *args, **kwargs)
907 except Exception:
--> 908 result = self._aggregate_named(func_or_funcs, *args, **kwargs)
909
910 index = Index(sorted(result), name=self.grouper.names[0])
/usr/lib/python2.7/site-packages/pandas/core/groupby.pyc in _aggregate_named(self, func, *args, **kwargs)
976 grp = self.get_group(name)
977 grp.name = name
--> 978 output = func(grp, *args, **kwargs)
979 if isinstance(output, np.ndarray):
980 raise Exception('Must produce aggregated value')
/work/python/fxcruncher/<ipython-input-27-df50f9522a2f> in <lambda>(x)
----> 1 grouped = slice.groupby( dr5minute.asof ).agg( { 'Low' : lambda x : x.min()[ 'Low' ], 'High' : lambda x : x.max()[ 'High' ] } )
IndexError: invalid index to scalar variable.
So any help on doing that would be greatly appreciated. If the path I chose is not going to work, please suggest other relatively efficient approach (I have millions of rows). Some resources on using Pandas for financial processing would also be nice.
回答1:
Your approach is sound, but fails because each function in the dict-of-functions applied to agg() receives a Series object reflecting the column matched by the key value. Therefore, it's not necessary to filter on column label again. With this, and assuming groupby preserves order, you can slice the Series to extract the first/last element of the Open/Close columns (note: groupby documentation does not claim to preserve order of original data series, but seems to in practice.)
In [50]: df.groupby(dr5minute.asof).agg({'Low': lambda s: s.min(),
'High': lambda s: s.max(),
'Open': lambda s: s[0],
'Close': lambda s: s[-1],
'Volume': lambda s: s.sum()})
Out[50]:
Close High Low Open Volume
key_0
1999-01-04 10:20:00 1.1806 1.1819 1.1801 1.1801 34
1999-01-04 10:25:00 1.1789 1.1815 1.1776 1.1807 91
1999-01-04 10:30:00 1.1791 1.1792 1.1776 1.1780 16
For reference, here is a table to summarize the expected input and output types of an aggregation function based on the groupby object type and how the aggregation function(s) is/are passed to agg().
agg() method agg func agg func agg()
input type accepts returns result
GroupBy Object
SeriesGroupBy function Series value Series
dict-of-funcs Series value DataFrame, columns match dict keys
list-of-funcs Series value DataFrame, columns match func names
DataFrameGroupBy function DataFrame Series/dict/ary DataFrame, columns match original DataFrame
dict-of-funcs Series value DataFrame, columns match dict keys, where dict keys must be columns in original DataFrame
list-of-funcs Series value DataFrame, MultiIndex columns (original cols x func names)
From the above table, if aggregation requires access to more than one column, the only option is to pass a single function to a DataFrameGroupBy object. Therefore, an alternate way to accomplish the original task is to define a function like the following:
def ohlcsum(df):
df = df.sort()
return {
'Open': df['Open'][0],
'High': df['High'].max(),
'Low': df['Low'].min(),
'Close': df['Close'][-1],
'Volume': df['Volume'].sum()
}
and apply agg() with it:
In [30]: df.groupby(dr5minute.asof).agg(ohlcsum)
Out[30]:
Open High Low Close Volume
key_0
1999-01-04 10:20:00 1.1801 1.1819 1.1801 1.1806 34
1999-01-04 10:25:00 1.1807 1.1815 1.1776 1.1789 91
1999-01-04 10:30:00 1.1780 1.1792 1.1776 1.1791 16
Though pandas may offer some cleaner built-in magic in the future, hopefully this explains how to work with today's agg() capabilities.
回答2:
just to be helpful to other users with a more recent version of Pandas, there is a resample method very fast and useful to accomplish the same task:
ohlc_dict = {
'Open':'first',
'High':'max',
'Low':'min',
'Close': 'last',
'Volume': 'sum'
}
df.resample('5T', how=ohlc_dict, closed='left', label='left')
回答3:
bitfinex_klines_pd = bitfinex_klines_pd.resample('4h').agg({
'open': lambda s: s[0],
'high': lambda df: df.max(),
'low': lambda df: df.min(),
'close': lambda df: df[-1],
'volume': lambda df: df.sum()
})
回答4:
Within my main() function I'm receiving streaming bid/ask data. I then do the following:
df = pd.DataFrame([])
for msg_type, msg in response.parts():
if msg_type == "pricing.Price":
sd = StreamingData(datetime.now(),instrument_string(msg),
mid_string(msg),account_api,account_id,
's','5min',balance)
df = df.append(sd.df())
sd.resample(df)
I created a class StreamingData() which takes the provided input (also created some functions to break up the bid/ask data into individual components (bid, ask, mid, instrument, etc.).
The beauty of this is all you have to do is change the 's' and '5min' to whatever timeframes you want. Set it to 'm' and 'D' to get daily prices by the minute.
This is what my StreamingData() looks like:
class StreamingData(object):
def __init__(self, time, instrument, mid, api, _id, xsec, xmin, balance):
self.time = time
self.instrument = instrument
self.mid = mid
self.api = api
self._id = _id
self.xsec = xsec
self.xmin = xmin
self.balance = balance
self.data = self.resample(self.df())
def df(self):
df1 = pd.DataFrame({'Time':[self.time]})
df2 = pd.DataFrame({'Mid':[float(self.mid)]})
df3 = pd.concat([df1,df2],axis=1,join='inner')
df = df3.set_index(['Time'])
df.index = pd.to_datetime(df.index,unit='s')
return df
def resample(self, df):
xx = df.to_period(freq=self.xsec)
openCol = xx.resample(self.xmin).first()
highCol = xx.resample(self.xmin).max()
lowCol = xx.resample(self.xmin).min()
closeCol = xx.resample(self.xmin).last()
self.data = pd.concat([openCol,highCol,lowCol,closeCol],
axis=1,join='inner')
self.data['Open'] = openCol.round(5)
self.data['High'] = highCol.round(5)
self.data['Low'] = lowCol.round(5)
self.data['Close'] = closeCol.round(5)
return self.data
So it takes in the data from StreamingData(), creates a time indexed dataframe within df(), appends it, then sends through to resample(). The prices I calculate are based off of: mid = (bid+ask)/2
来源:https://stackoverflow.com/questions/9944436/converting-ohlc-stock-data-into-a-different-timeframe-with-python-and-pandas