pytables writes much faster than h5py. Why?

前端 未结 1 1699
情深已故
情深已故 2021-01-02 23:46

I noticed that writing .h5 files takes much longer if I use the h5py library instead of the pytables library. What is the reason? This is also true when the shape of the arr

相关标签:
1条回答
  • 2021-01-03 00:32

    This is an interesting comparison of PyTables and h5py write performance. Typically I use them to read HDF5 files (and usually with a few reads of large datasets), so haven't noticed this difference. My thoughts align with @max9111: that performance should improve as the number of write operations decreased as the size of the written dataset increased. To that end, I reworked your code to write N lines of data using fewer loops. (Code is at the end).
    Results were surprising (to me). Key findings:
    1. Total time to write all of the data was a linear function of the # of loops (for both PyTables and h5py).
    2. The performance difference between PyTables and h5py only improved slightly as dataset I/O size increased.
    3. Pytables was 5.4x faster writing 1 row at a time (1,527,416 writes), and was 3.5x faster writing 88 rows at a time (17,357 writes).

    Here is a plot comparing performance.
    Chart with values for table above.

    Also, I noticed your code comments say "append columns", but you are extending the first dimension (rows of a HDF5 table/dataset). I rewrote your code to test performance extending the second dimension (adding columns to the HDF5 file), and saw very similar performance.

    Initially I thought the I/O bottleneck was due to resizing the datasets. So, I rewrote the example to initially size the array to hold the all rows. This did NOT improve performance (and significantly degraded h5py performance). That was very surprising. Not sure what to make of it.

    Here is my example. It uses 3 variables that size the array (as data is added):

    • cdim: # of columns (fixed)
    • row_loops: # of write loops
    • block_size: size of data block written on each loop
    • row_loops*block_size = total number of rows written

    I also made a small change to the add Ones instead of Zeros (to verify data was written, and moved it to the top (and out of the timing loops).

    My code here:

    import h5py
    import tables
    import numpy as np
    from time import time
    
    cdim, block_size, row_loops = 64, 4, 381854 
    vals = np.ones((block_size, cdim), dtype="float32")
    
    # append rows
    print("PYTABLES: append rows: %d blocks with: %d rows" % (row_loops, block_size))
    print("=" * 32)
    f = tables.open_file("rowapp_test_tb.h5", "w")
    a = f.create_earray(f.root, "time_data", atom=tables.Float32Atom(), shape=(0, cdim))
    t1 = time()
    for i in range(row_loops):
        a.append(vals)
    tcre = round(time() - t1, 3)
    thcre = round(cdim * block_size * row_loops * 4 / (tcre * 1024 * 1024), 1)
    print("Time to append %d rows: %s sec (%s MB/s)" % (block_size * row_loops, tcre, thcre))
    print("=" * 32)
    chunkshape = a.chunkshape
    f.close()
    
    print("H5PY: append rows %d blocks with: %d rows" % (row_loops, block_size))
    print("=" * 32)
    f = h5py.File(name="rowapp_test_h5.h5",mode='w')
    a = f.create_dataset(name='time_data',shape=(0, cdim),
                         maxshape=(block_size*row_loops,cdim),
                         dtype='f',chunks=chunkshape)
    t1 = time()
    samplesWritten = 0
    for i in range(row_loops):
        a.resize(((i+1)*block_size, cdim))
        a[samplesWritten:samplesWritten+block_size] = vals
        samplesWritten += block_size
    tcre = round(time() - t1, 3)
    thcre = round(cdim * block_size * row_loops * 4 / (tcre * 1024 * 1024), 1)
    print("Time to append %d rows: %s sec (%s MB/s)" % (block_size * row_loops, tcre, thcre))
    print("=" * 32)
    f.close()
    
    0 讨论(0)
提交回复
热议问题