Improve Query Performance From a Large HDFStore Table with Pandas

前端 未结 1 1862
说谎
说谎 2021-02-03 15:13

I have a large (~160 million rows) dataframe that I\'ve stored to disk with something like this:

    def fillStore(store, tablename):
        files = glob.glob(         


        
相关标签:
1条回答
  • 2021-02-03 16:05

    Here's some recommendations and a similar question is here

    Use compression: see here. You should try this (this could make it faster / slower depending on exactly what you are querying), YMMV.

    ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc in.h5 out.h5
    

    Use a hierarchical query in chunks. What I mean is this. Since you have a relatively small number of c_id and f_id that you care about, structure a single query something like this. This is kind of like using isin.

    f_ids = list_of_f_ids that I care about
    c_ids = list_of_c_ids that I care about
    
    def create_batches(l, maxn=32):
        """ create a list of batches, maxed at maxn """
        batches = []
        while(True):
            if len(l) <= maxn:
                if len(l) > 0:
                    batches.append(l)
                break
            batches.append(l[0:maxn])
            l = l[maxn:]
        return batches
    
    
    results = []
    for f_id_batch in create_batches(f_id_list):
    
        for c_id_batch in create_batches(c_id_list):
    
            q = "f_id={f_id} & c_id={c_id}".format(
                    f_id=f_id_batch,
                    c_id=c_id_batch)
    
            # you can include the max/min times in here as well (they would be max/min
            # time for ALL the included batches though, maybe easy for you to compute
    
            result = store.select('df',where=q)
    
            # sub process this result
    
            def f(x):
                # you will need to filter out the min/max timestamps here (which I gather
                # are somewhat dependent on f_id/c_id group
    
                #### process the data and return something
                # you could do something like: ``return x.describe()`` for simple stats
    
             results.append(result.groupby(['f_id','c_id').apply(f))
    
    results = pd.concat(results)
    

    The key here is to process so that the isin DOES not have more that 32 members for any variable that you are querying on. This is an internal numpy/pytables limitation. If you exceed this, the query will work, but it will drop that variable and do a reindex on ALL the data (which is NOT what you want here).

    This way you will have a nice subset of data in memory over just a few loops. These queries I think would take about the same time as most of your queries or so, but you will have way fewer.

    The query time is roughly constant for a given subset (unless the data is ordered such that it it is completely indexed).

    So the query scans 'blocks' of data (which is what the indexes point to). If you have lots of hits across many blocks then the query is slower.

    Here's an example

    In [5]: N = 100000000
    
    In [6]: df = DataFrame(np.random.randn(N,3),columns=['A','B','C'])
    
    In [7]: df['c_id'] = np.random.randint(0,10,size=N)
    
    In [8]: df['f_id'] = np.random.randint(0,10,size=N)
    
    In [9]: df.index = date_range('20130101',periods=N,freq='s')
    
    In [10]: df.to_hdf('test2.h5','df',mode='w',data_columns=['c_id','f_id'])
    
    In [11]: df.head()
    Out[11]: 
                                A         B         C  c_id  f_id
    2013-01-01 00:00:00  0.037287  1.153534  0.639669     8     7
    2013-01-01 00:00:01  1.741046  0.459821  0.194282     8     3
    2013-01-01 00:00:02 -2.273919 -0.141789  0.770567     1     1
    2013-01-01 00:00:03  0.320879 -0.108426 -1.310302     8     6
    2013-01-01 00:00:04 -1.445810 -0.777090 -0.148362     5     5
    2013-01-01 00:00:05  1.608211  0.069196  0.025021     3     6
    2013-01-01 00:00:06 -0.561690  0.613579  1.071438     8     2
    2013-01-01 00:00:07  1.795043 -0.661966  1.210714     0     0
    2013-01-01 00:00:08  0.176347 -0.461176  1.624514     3     6
    2013-01-01 00:00:09 -1.084537  1.941610 -1.423559     9     1
    2013-01-01 00:00:10 -0.101036  0.925010 -0.809951     0     9
    2013-01-01 00:00:11 -1.185520  0.968519  2.871983     7     5
    2013-01-01 00:00:12 -1.089267 -0.333969 -0.665014     3     6
    2013-01-01 00:00:13  0.544427  0.130439  0.423749     5     7
    2013-01-01 00:00:14  0.112216  0.404801 -0.061730     5     4
    2013-01-01 00:00:15 -1.349838 -0.639435  0.993495     0     9
    
    
    In [2]: %timeit pd.read_hdf('test2.h5','df',where="f_id=[1] & c_id=[2]")
    1 loops, best of 3: 13.9 s per loop
    
    In [3]: %timeit pd.read_hdf('test2.h5','df',where="f_id=[1,2] & c_id=[1,2]")
    1 loops, best of 3: 21.2 s per loop
    
    In [4]: %timeit pd.read_hdf('test.2h5','df',where="f_id=[1,2,3] & c_id=[1,2,3]")
    1 loops, best of 3: 42.8 s per loop
    

    This particular example is 5GB uncompressed and 2.9GB compressed. These results are on the compressed data. In THIS case it is actually quite a bit faster to use the uncompressed (e.g. the first loop taked 3.5s). This is 100MM rows.

    So using the last example (4) you are getting 9x the data of the first in a little over 3x the query time.

    However your speedup should be MUCH more, because you won't be selecting on individual timestamps, rather doing that later.

    This whole approach takes into account that you have enough main memory to hold your results in the batch sizes (e.g. you are selecting a relatively small part of the set in the batch queries).

    0 讨论(0)
提交回复
热议问题