I have following data (18,619,211 rows) stored as a pandas dataframe object in hdf5 file:
date id2 w
id
100
here are the docs for querying on non-index columns.
Create the test data. It is not clear how the original frame is constructed, e.g. whether its unique data and the ranges, so I have created a sample, with 10M rows, and a multi-level date range with the id column.
In [60]: np.random.seed(1234)
In [62]: pd.set_option('display.max_rows',20)
In [63]: index = pd.MultiIndex.from_product([np.arange(10000,11000),pd.date_range('19800101',periods=10000)],names=['id','date'])
In [67]: df = DataFrame(dict(id2=np.random.randint(0,1000,size=len(index)),w=np.random.randn(len(index))),index=index).reset_index().set_index(['id','date'])
In [68]: df
Out[68]:
id2 w
id date
10000 1980-01-01 712 0.371372
1980-01-02 718 -1.255708
1980-01-03 581 -1.182727
1980-01-04 202 -0.947432
1980-01-05 493 -0.125346
1980-01-06 752 0.380210
1980-01-07 435 -0.444139
1980-01-08 128 -1.885230
1980-01-09 425 1.603619
1980-01-10 449 0.103737
... ... ...
10999 2007-05-09 8 0.624532
2007-05-10 669 0.268340
2007-05-11 918 0.134816
2007-05-12 979 -0.769406
2007-05-13 969 -0.242123
2007-05-14 950 -0.347884
2007-05-15 49 -1.284825
2007-05-16 922 -1.313928
2007-05-17 347 -0.521352
2007-05-18 353 0.189717
[10000000 rows x 2 columns]
Write the data to disk, showing how to create a data column (note that the indexes are automatically queryable, this allows id2 to be queryable as well). This is de-facto equivalent to doing. This takes care of opening and closing the store (you can accomplish the same thing by opening a store, appending, and closing).
In order to query a column, it MUST BE A DATA COLUMN or an index of the frame.
In [70]: df.to_hdf('test.h5','df',mode='w',data_columns=['id2'],format='table')
In [71]: !ls -ltr test.h5
-rw-rw-r-- 1 jreback users 430540284 May 26 17:16 test.h5
Queries
In [80]: ids=[10101,10898]
In [81]: start_date='20010101'
In [82]: end_date='20010301'
You can specify dates as string (either in-line or as variables; you can also specify Timestamp like objects)
In [83]: pd.read_hdf('test.h5','df',where='date>start_date & date
You can use in-line lists
In [84]: pd.read_hdf('test.h5','df',where='date>start_date & date
You can also specify boolean expressions
In [85]: pd.read_hdf('test.h5','df',where='date>start_date & date500 & id2<600')
Out[85]:
id2 w
id date
10101 2001-01-12 534 -0.220692
2001-01-14 596 -2.225393
2001-01-16 596 0.956239
2001-01-30 513 -2.528996
2001-02-01 572 -1.877398
2001-02-13 569 -0.940748
2001-02-14 541 1.035619
2001-02-21 571 -0.116547
10898 2001-01-16 591 0.082564
2001-02-06 586 0.470872
2001-02-10 531 -0.536194
2001-02-16 586 0.949947
2001-02-19 530 -1.031167
2001-02-22 540 -1.827251
To answer your actual question I would do this (their is really not enough information, but I'll put some reasonable expectations):
So for example say that you have 1000 unique ids with 10000 dates per as my example demonstrates. You want to select say 200 of these, with a date range of 1000.
So in this case I would simply select on the dates then do the in-memory comparison, something like this:
df = pd.read_hdf('test.h5','df',where='date=>global_start_date & date<=global_end_date')
df[df.isin(list_of_ids)]
You also might have dates that change per ids. So chunk them, this time using a list of ids.
Something like this:
output = []
for i in len(list_of_ids) % 30:
ids = list_of_ids[i:(i+30)]
start_date = get_start_date_for_these_ids (global)
end_date = get_end_date_for_these_ids (global)
where = 'id=ids & start_date>=start_date & end_date<=end_date'
df = pd.read_hdf('test.h5','df',where=where)
output.append(df)
final_result = concat(output)
The basic idea then is to select a superset of the data using the criteria that you want, sub-selecting so it fits in memory, but you limit the number of queries you do (e.g. imagine that you end up selecting a single row with your query, if you have to query this 18M times that is bad).