I have a large (~160 million rows) dataframe that I\'ve stored to disk with something like this:
def fillStore(store, tablename):
files = glob.glob(
Here's some recommendations and a similar question is here
Use compression: see here. You should try this (this could make it faster / slower depending on exactly what you are querying), YMMV.
ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc in.h5 out.h5
Use a hierarchical query in chunks. What I mean is this. Since you have a relatively small number of c_id
and f_id
that you care about, structure a single query something like this. This is kind of like using isin
.
f_ids = list_of_f_ids that I care about
c_ids = list_of_c_ids that I care about
def create_batches(l, maxn=32):
""" create a list of batches, maxed at maxn """
batches = []
while(True):
if len(l) <= maxn:
if len(l) > 0:
batches.append(l)
break
batches.append(l[0:maxn])
l = l[maxn:]
return batches
results = []
for f_id_batch in create_batches(f_id_list):
for c_id_batch in create_batches(c_id_list):
q = "f_id={f_id} & c_id={c_id}".format(
f_id=f_id_batch,
c_id=c_id_batch)
# you can include the max/min times in here as well (they would be max/min
# time for ALL the included batches though, maybe easy for you to compute
result = store.select('df',where=q)
# sub process this result
def f(x):
# you will need to filter out the min/max timestamps here (which I gather
# are somewhat dependent on f_id/c_id group
#### process the data and return something
# you could do something like: ``return x.describe()`` for simple stats
results.append(result.groupby(['f_id','c_id').apply(f))
results = pd.concat(results)
The key here is to process so that the isin
DOES not have more that 32 members
for any variable that you are querying on. This is an internal numpy/pytables limitation.
If you exceed this, the query will work, but it will drop that variable and do a reindex
on ALL the data (which is NOT what you want here).
This way you will have a nice subset of data in memory over just a few loops. These queries I think would take about the same time as most of your queries or so, but you will have way fewer.
The query time is roughly constant for a given subset (unless the data is ordered such that it it is completely indexed).
So the query scans 'blocks' of data (which is what the indexes point to). If you have lots of hits across many blocks then the query is slower.
Here's an example
In [5]: N = 100000000
In [6]: df = DataFrame(np.random.randn(N,3),columns=['A','B','C'])
In [7]: df['c_id'] = np.random.randint(0,10,size=N)
In [8]: df['f_id'] = np.random.randint(0,10,size=N)
In [9]: df.index = date_range('20130101',periods=N,freq='s')
In [10]: df.to_hdf('test2.h5','df',mode='w',data_columns=['c_id','f_id'])
In [11]: df.head()
Out[11]:
A B C c_id f_id
2013-01-01 00:00:00 0.037287 1.153534 0.639669 8 7
2013-01-01 00:00:01 1.741046 0.459821 0.194282 8 3
2013-01-01 00:00:02 -2.273919 -0.141789 0.770567 1 1
2013-01-01 00:00:03 0.320879 -0.108426 -1.310302 8 6
2013-01-01 00:00:04 -1.445810 -0.777090 -0.148362 5 5
2013-01-01 00:00:05 1.608211 0.069196 0.025021 3 6
2013-01-01 00:00:06 -0.561690 0.613579 1.071438 8 2
2013-01-01 00:00:07 1.795043 -0.661966 1.210714 0 0
2013-01-01 00:00:08 0.176347 -0.461176 1.624514 3 6
2013-01-01 00:00:09 -1.084537 1.941610 -1.423559 9 1
2013-01-01 00:00:10 -0.101036 0.925010 -0.809951 0 9
2013-01-01 00:00:11 -1.185520 0.968519 2.871983 7 5
2013-01-01 00:00:12 -1.089267 -0.333969 -0.665014 3 6
2013-01-01 00:00:13 0.544427 0.130439 0.423749 5 7
2013-01-01 00:00:14 0.112216 0.404801 -0.061730 5 4
2013-01-01 00:00:15 -1.349838 -0.639435 0.993495 0 9
In [2]: %timeit pd.read_hdf('test2.h5','df',where="f_id=[1] & c_id=[2]")
1 loops, best of 3: 13.9 s per loop
In [3]: %timeit pd.read_hdf('test2.h5','df',where="f_id=[1,2] & c_id=[1,2]")
1 loops, best of 3: 21.2 s per loop
In [4]: %timeit pd.read_hdf('test.2h5','df',where="f_id=[1,2,3] & c_id=[1,2,3]")
1 loops, best of 3: 42.8 s per loop
This particular example is 5GB uncompressed and 2.9GB compressed. These results are on the compressed data. In THIS case it is actually quite a bit faster to use the uncompressed (e.g. the first loop taked 3.5s). This is 100MM rows.
So using the last example (4) you are getting 9x the data of the first in a little over 3x the query time.
However your speedup should be MUCH more, because you won't be selecting on individual timestamps, rather doing that later.
This whole approach takes into account that you have enough main memory to hold your results in the batch sizes (e.g. you are selecting a relatively small part of the set in the batch queries).