问题
Preamble:
- I am working with L2 tick data.
- The bid/offer will not necessarily be balanced in terms of number of levels
- The number of levels could range from 0 to 20.
- I want to save the full book to disk every time it is updated
- I believe I want to use numpy array such that I can use h5py/vaex to perform offline data processing.
- I'll ideally be writing (appending) to disk every x updates or on a timer.
If we assume an example book looks like this:
array([datetime.datetime(2017, 11, 6, 14, 57, 8, 532152), # book creation time
array(['20171106-14:57:08.528', '20171106-14:57:08.428'], dtype='<U21'), # quote entry (bid)
array([1.30699, 1.30698]), # quote price (bid)
array([100000., 250000.]), # quote size (bid)
array(['20171106-14:57:08.528'], dtype='<U21'), # quote entry (offer)
array([1.30709]), # quote price (offer)
array([100000.])], # quote size (offer)
dtype=object)
Numpy doesnt like the jagged-ness of the array, and whilst I'm happy (enough) to use np.pad
to pad the times/prices/sizes to a length of 20, I don't think I want to be creating an array for the book creation time.
Could/should I be going about this differently? Ultimately I'll want to do asof joins against the a list of trades hence I'd like a column-store approach. How is everyone else doing this? Are they storing multiple rows? or multiple columns?
EDIT:
I want to be able to do something like:
with h5py.File("foo.h5", "w") as f:
f.create_dataset(data=my_np_array)
and then later perform an asof join between my hdf5 tickdata and a dataframe of trades.
EDIT2:
In KDB the entry would look like:
q)t:([]time:2017.11.06D14:57:08.528;sym:`EURUSD;bid_time:enlist 2017.11.06T14:57:08.528 20171106T14:57:08.428;bid_px:enlist 1.30699, 1.30698;bid_size:enlist 100000. 250000.;ask_time:enlist 2017.11.06T14:57:08.528;ask_px:enlist 1.30709;ask_size:enlist 100000.)
q)t
time sym bid_time bid_px bid_size ask_time ask_px ask_size
-----------------------------------------------------------------------------------------------------------------------------------------------------------
2017.11.06D14:57:08.528000000 EURUSD 2017.11.06T14:57:08.528 2017.11.06T14:57:08.428 1.30699 1.30698 100000 250000 2017.11.06T14:57:08.528 1.30709 100000
q)first t
time | 2017.11.06D14:57:08.528000000
sym | `EURUSD
bid_time| 2017.11.06T14:57:08.528 2017.11.06T14:57:08.428
bid_px | 1.30699 1.30698
bid_size| 100000 250000f
ask_time| 2017.11.06T14:57:08.528
ask_px | 1.30709
ask_size| 100000f
EDIT3:
Should I just give in with the idea of a nested column and have 120 columns (20*(bid_times+bid_prices+bid_sizes+ask_times+ask_prices+ask_sizes)? Seems excessive, and unwieldy to work with...
来源:https://stackoverflow.com/questions/62980473/storing-l2-tick-data-with-python