Storing L2 tick data with Python

问题

Preamble:

I am working with L2 tick data.
The bid/offer will not necessarily be balanced in terms of number of levels
The number of levels could range from 0 to 20.
I want to save the full book to disk every time it is updated
I believe I want to use numpy array such that I can use h5py/vaex to perform offline data processing.
I'll ideally be writing (appending) to disk every x updates or on a timer.

If we assume an example book looks like this:

array([datetime.datetime(2017, 11, 6, 14, 57, 8, 532152),                       # book creation time
       array(['20171106-14:57:08.528', '20171106-14:57:08.428'], dtype='<U21'), # quote entry (bid)
       array([1.30699, 1.30698]),                                               # quote price (bid)
       array([100000., 250000.]),                                               # quote size (bid)
       array(['20171106-14:57:08.528'], dtype='<U21'),                          # quote entry (offer)
       array([1.30709]),                                                        # quote price (offer)
       array([100000.])],                                                       # quote size (offer)
       dtype=object)

Numpy doesnt like the jagged-ness of the array, and whilst I'm happy (enough) to use np.pad to pad the times/prices/sizes to a length of 20, I don't think I want to be creating an array for the book creation time.

Could/should I be going about this differently? Ultimately I'll want to do asof joins against the a list of trades hence I'd like a column-store approach. How is everyone else doing this? Are they storing multiple rows? or multiple columns?

EDIT:

I want to be able to do something like:

with h5py.File("foo.h5", "w") as f:
    f.create_dataset(data=my_np_array)

and then later perform an asof join between my hdf5 tickdata and a dataframe of trades.

EDIT2:

In KDB the entry would look like:

q)t:([]time:2017.11.06D14:57:08.528;sym:`EURUSD;bid_time:enlist 2017.11.06T14:57:08.528 20171106T14:57:08.428;bid_px:enlist 1.30699, 1.30698;bid_size:enlist 100000. 250000.;ask_time:enlist 2017.11.06T14:57:08.528;ask_px:enlist 1.30709;ask_size:enlist 100000.)
q)t
time                          sym    bid_time                                        bid_px          bid_size      ask_time                ask_px  ask_size
-----------------------------------------------------------------------------------------------------------------------------------------------------------
2017.11.06D14:57:08.528000000 EURUSD 2017.11.06T14:57:08.528 2017.11.06T14:57:08.428 1.30699 1.30698 100000 250000 2017.11.06T14:57:08.528 1.30709 100000  
q)first t
time    | 2017.11.06D14:57:08.528000000
sym     | `EURUSD
bid_time| 2017.11.06T14:57:08.528 2017.11.06T14:57:08.428
bid_px  | 1.30699 1.30698
bid_size| 100000 250000f
ask_time| 2017.11.06T14:57:08.528
ask_px  | 1.30709
ask_size| 100000f

EDIT3:

Should I just give in with the idea of a nested column and have 120 columns (20*(bid_times+bid_prices+bid_sizes+ask_times+ask_prices+ask_sizes)? Seems excessive, and unwieldy to work with...

来源：https://stackoverflow.com/questions/62980473/storing-l2-tick-data-with-python

标签

python

numpy