PyTables read random subset

后端 未结 1 1537
夕颜
夕颜 2021-02-16 00:27

Is it possible to read a random subset of rows from HDF5 (via pyTables or, preferably pandas)? I have a very large dataset with million of rows, but only need a sample of few th

1条回答
  •  走了就别回头了
    2021-02-16 00:48

    Using HDFStore docs are here, compression docs are here

    Random access via a constructed index is supported in 0.13

    In [26]: df = DataFrame(np.random.randn(100,2),columns=['A','B'])
    
    In [27]: df.to_hdf('test.h5','df',mode='w',format='table')
    
    In [28]: store = pd.HDFStore('test.h5')
    
    In [29]: nrows = store.get_storer('df').nrows
    
    In [30]: nrows
    Out[30]: 100
    
    In [32]: r = np.random.randint(0,nrows,size=10)
    
    In [33]: r
    Out[33]: array([69, 28,  8,  2, 14, 51, 92, 25, 82, 64])
    
    In [34]: pd.read_hdf('test.h5','df',where=pd.Index(r))
    Out[34]: 
               A         B
    69 -0.370739 -0.325433
    28  0.155775  0.961421
    8   0.101041 -0.047499
    2   0.204417  0.470805
    14  0.599348  1.174012
    51  0.634044 -0.769770
    92  0.240077 -0.154110
    25  0.367211 -1.027087
    82 -0.698825 -0.084713
    64 -1.029897 -0.796999
    
    [10 rows x 2 columns]
    

    To include an additional condition you would do like this:

    # make sure that we have indexable columns
    df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)
    
    # select where the index (an integer index) matches r and A > 0
    In [14]: r
    Out[14]: array([33, 51, 33, 95, 69, 21, 43, 58, 58, 58])
    
    In [13]: pd.read_hdf('test.h5','df',where='index=r & A>0')
    Out[13]: 
               A         B
    21  1.456244  0.173443
    43  0.174464 -0.444029
    
    [2 rows x 2 columns]
    

    0 讨论(0)
提交回复
热议问题