Query HDF5 in Pandas

后端 未结 1 1673
别跟我提以往
别跟我提以往 2021-02-13 15:18

I have following data (18,619,211 rows) stored as a pandas dataframe object in hdf5 file:

             date    id2         w
id                              
100         


        
相关标签:
1条回答
  • 2021-02-13 15:35

    here are the docs for querying on non-index columns.

    Create the test data. It is not clear how the original frame is constructed, e.g. whether its unique data and the ranges, so I have created a sample, with 10M rows, and a multi-level date range with the id column.

    In [60]: np.random.seed(1234)
    
    In [62]: pd.set_option('display.max_rows',20)
    
    In [63]: index = pd.MultiIndex.from_product([np.arange(10000,11000),pd.date_range('19800101',periods=10000)],names=['id','date'])
    
    In [67]: df = DataFrame(dict(id2=np.random.randint(0,1000,size=len(index)),w=np.random.randn(len(index))),index=index).reset_index().set_index(['id','date'])
    
    In [68]: df
    Out[68]: 
                      id2         w
    id    date                     
    10000 1980-01-01  712  0.371372
          1980-01-02  718 -1.255708
          1980-01-03  581 -1.182727
          1980-01-04  202 -0.947432
          1980-01-05  493 -0.125346
          1980-01-06  752  0.380210
          1980-01-07  435 -0.444139
          1980-01-08  128 -1.885230
          1980-01-09  425  1.603619
          1980-01-10  449  0.103737
    ...               ...       ...
    10999 2007-05-09    8  0.624532
          2007-05-10  669  0.268340
          2007-05-11  918  0.134816
          2007-05-12  979 -0.769406
          2007-05-13  969 -0.242123
          2007-05-14  950 -0.347884
          2007-05-15   49 -1.284825
          2007-05-16  922 -1.313928
          2007-05-17  347 -0.521352
          2007-05-18  353  0.189717
    
    [10000000 rows x 2 columns]
    

    Write the data to disk, showing how to create a data column (note that the indexes are automatically queryable, this allows id2 to be queryable as well). This is de-facto equivalent to doing. This takes care of opening and closing the store (you can accomplish the same thing by opening a store, appending, and closing).

    In order to query a column, it MUST BE A DATA COLUMN or an index of the frame.

    In [70]: df.to_hdf('test.h5','df',mode='w',data_columns=['id2'],format='table')
    
    In [71]: !ls -ltr test.h5
    -rw-rw-r-- 1 jreback users 430540284 May 26 17:16 test.h5
    

    Queries

    In [80]: ids=[10101,10898]
    
    In [81]: start_date='20010101'
    
    In [82]: end_date='20010301'
    

    You can specify dates as string (either in-line or as variables; you can also specify Timestamp like objects)

    In [83]: pd.read_hdf('test.h5','df',where='date>start_date & date<end_date')
    Out[83]: 
                      id2         w
    id    date                     
    10000 2001-01-02  972 -0.146107
          2001-01-03  954  1.420412
          2001-01-04  567  1.077633
          2001-01-05   87 -0.042838
          2001-01-06   79 -1.791228
          2001-01-07  744  1.110478
          2001-01-08  237 -0.846086
          2001-01-09  998 -0.696369
          2001-01-10  266 -0.595555
          2001-01-11  206 -0.294633
    ...               ...       ...
    10999 2001-02-19  616 -0.745068
          2001-02-20  577 -1.474748
          2001-02-21  990 -1.276891
          2001-02-22  939 -1.369558
          2001-02-23  621 -0.214365
          2001-02-24  396 -0.142100
          2001-02-25  492 -0.204930
          2001-02-26  478  1.839291
          2001-02-27  688  0.291504
          2001-02-28  356 -1.987554
    
    [58000 rows x 2 columns]
    

    You can use in-line lists

    In [84]: pd.read_hdf('test.h5','df',where='date>start_date & date<end_date & id=ids')
    Out[84]: 
                      id2         w
    id    date                     
    10101 2001-01-02  722  1.620553
          2001-01-03  849 -0.603468
          2001-01-04  635 -1.419072
          2001-01-05  331  0.521634
          2001-01-06  730  0.008830
          2001-01-07  706 -1.006412
          2001-01-08   59  1.380005
          2001-01-09  689  0.017830
          2001-01-10  788 -3.090800
          2001-01-11  704 -1.491824
    ...               ...       ...
    10898 2001-02-19  530 -1.031167
          2001-02-20  652 -0.019266
          2001-02-21  472  0.638266
          2001-02-22  540 -1.827251
          2001-02-23  654 -1.020140
          2001-02-24  328 -0.477425
          2001-02-25  871 -0.892684
          2001-02-26  166  0.894118
          2001-02-27  806  0.648240
          2001-02-28  824 -1.051539
    
    [116 rows x 2 columns]
    

    You can also specify boolean expressions

    In [85]: pd.read_hdf('test.h5','df',where='date>start_date & date<end_date & id=ids & id2>500 & id2<600')
    Out[85]: 
                      id2         w
    id    date                     
    10101 2001-01-12  534 -0.220692
          2001-01-14  596 -2.225393
          2001-01-16  596  0.956239
          2001-01-30  513 -2.528996
          2001-02-01  572 -1.877398
          2001-02-13  569 -0.940748
          2001-02-14  541  1.035619
          2001-02-21  571 -0.116547
    10898 2001-01-16  591  0.082564
          2001-02-06  586  0.470872
          2001-02-10  531 -0.536194
          2001-02-16  586  0.949947
          2001-02-19  530 -1.031167
          2001-02-22  540 -1.827251
    

    To answer your actual question I would do this (their is really not enough information, but I'll put some reasonable expectations):

    • Do't loop over queries, unless you have a very small number of absolute queries
    • Read the biggest chunk into memory that you can. Usually this is accomplished by selecting out the biggest ranges of data that you need, even if you select MORE data than you actually need.
    • Then subselect using in-memory expressions, which will generally be orders of magnitude faster.
    • List elements are limited to about 30 elements in total (this is current an implementation limit on the PyTables side). It will work if you specify more, but what will happen is that you will read in a lot of data, then it will be reindexed down (in-memory). So user needs to be aware of this.

    So for example say that you have 1000 unique ids with 10000 dates per as my example demonstrates. You want to select say 200 of these, with a date range of 1000.

    So in this case I would simply select on the dates then do the in-memory comparison, something like this:

    df = pd.read_hdf('test.h5','df',where='date=>global_start_date & date<=global_end_date')
    df[df.isin(list_of_ids)]
    

    You also might have dates that change per ids. So chunk them, this time using a list of ids.

    Something like this:

    output = []
    for i in len(list_of_ids) % 30:
        ids = list_of_ids[i:(i+30)]
        start_date = get_start_date_for_these_ids (global)
        end_date = get_end_date_for_these_ids (global)
        where = 'id=ids & start_date>=start_date & end_date<=end_date'
        df = pd.read_hdf('test.h5','df',where=where)
        output.append(df)
    
     final_result = concat(output)
    

    The basic idea then is to select a superset of the data using the criteria that you want, sub-selecting so it fits in memory, but you limit the number of queries you do (e.g. imagine that you end up selecting a single row with your query, if you have to query this 18M times that is bad).

    0 讨论(0)
提交回复
热议问题