h5py not sticking to chunking specification?

前端 未结 1 653
梦毁少年i
梦毁少年i 2020-12-04 03:13

Problem: I have existing netCDF4 files (about 5000 of them), (typically in shape 96x3712x3712) datapoints (float32). These are files with the first dimension being time (1 f

相关标签:
1条回答
  • 2020-12-04 03:35

    The influence of chunk size

    In a worst case scenario reading and writing one chunk can be considered as random read/write operation. The main advantage of a SSD is the speed of reading or writing small chunks of data. A HDD is much slower at this task (a factor 100 can be observed), a NAS can even be much slower than a HDD.

    So the solution of the problem will be a larger chunk size. Some benchmarks on my system (Core i5-4690).

    Exampe_1 (chunk size (1,29,29)=3,4 kB):

    import numpy as np
    import tables #needed for blosc
    import h5py as h5
    import time
    import h5py_cache as h5c
    
    def original_chunk_size():
        File_Name_HDF5='some_Path'
        #Array=np.zeros((1,3712,3712),dtype=np.float32)
        Array=np.random.rand(96,3712,3712)
    
        f = h5.File(File_Name_HDF5, 'a',libver='latest')
        f.swmr_mode = True
        nodays=1
    
        shape = 96*nodays, 3712, 3712
        d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(1,29,29),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
    
        #Writing
        t1=time.time()
        for i in xrange(0,96*nodays):
            d[i:i+1,:,:]=Array
    
        f.close()
        print(time.time()-t1)
    
        #Reading
        f = h5.File(File_Name_HDF5, 'a',libver='latest')
        f.swmr_mode = True
        d=f['variable']
    
        for i in xrange(0,3712,29):
            for j in xrange(0,3712,29):
                A=np.copy(d[:,i:i+29,j:j+29])
    
        print(time.time()-t1)
    

    Results (write/read):

    SSD: 38s/54s

    HDD: 40s/57s

    NAS: 252s/823s

    In the second example I will use h5py_chache because I wan't to maintain providing chunks of (1,3712,3712). The standard chunk-chache-size is only one MB so it has to be changed, to avoid multiple read/write operations on chunks. https://pypi.python.org/pypi/h5py-cache/1.0

    Exampe_2 (chunk size (96,58,58)=1,3 MB):

    import numpy as np
    import tables #needed for blosc
    import h5py as h5
    import time
    import h5py_cache as h5c
    
    def modified_chunk_size():
        File_Name_HDF5='some_Path'
        Array=np.random.rand(1,3712,3712)
    
        f = h5c.File(File_Name_HDF5, 'a',libver='latest', 
        chunk_cache_mem_size=6*1024**3)
        f.swmr_mode = True
        nodays=1
    
        shape = 96*nodays, 3712, 3712
        d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
    
        #Writing
        t1=time.time()
        for i in xrange(0,96*nodays):
            d[i:i+1,:,:]=Array
    
        f.close()
        print(time.time()-t1)
    
        #Reading
        f = h5c.File(File_Name_HDF5, 'a',libver='latest', chunk_cache_mem_size=6*1024**3) #6 GB chunk chache
        f.swmr_mode = True
        d=f['variable']
    
        for i in xrange(0,3712,58):
            for j in xrange(0,3712,58):
                A=np.copy(d[:,i:i+58,j:j+58])
    
        print(time.time()-t1)
    

    Results (write/read):

    SSD: 10s/16s

    HDD: 10s/16s

    NAS: 13s/20s

    The read/write speed can further be improved by mininimizing the api calls (reading and writing larger chunk blocks).

    I also want't to mention her the compression method. Blosc can achieve up to 1GB/s throughput (CPU bottlenecking) gzip is slower, but provides better compression ratios.

    d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=3)
    

    20s/30s file size: 101 MB

    d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=6)

    50s/58s file size: 87 MB

    d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=9)

    50s/60s file size: 64 MB

    And now a benchmark of a whole month (30 days). The writing is a bit optimized and is written with (96,3712, 3712).

    def modified_chunk_size():
        File_Name_HDF5='some_Path'
    
        Array_R=np.random.rand(1,3712,3712)
        Array=np.zeros((96,3712,3712),dtype=np.float32)
        for j in xrange(0,96):
            Array[j,:,:]=Array_R
    
        f = h5.File(File_Name_HDF5, 'a',libver='latest')
        f.swmr_mode = True
        nodays=30
    
        shape = 96, 3712, 3712
        d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
    
        #Writing
        t1=time.time()
        for i in xrange(0,96*nodays,96):
            d[i:i+96,:,:]=Array
            d.resize((d.shape[0]+96,shape[1],shape[2]))
    
        f.close()
        print(time.time()-t1)
    
        #Reading
        f = h5.File(File_Name_HDF5, 'a',libver='latest')
        f.swmr_mode = True
        d=f['variable']
        for i in xrange(0,3712,58):
            for j in xrange(0,3712,58):
                A=np.copy(d[:,i:i+58,j:j+58])
    
        print(time.time()-t1)
    

    133s/301s with blosc

    432s/684s with gzip compression_opts=3

    I had the same problems when accessing data on a NAS. I hope this helps...

    0 讨论(0)
提交回复
热议问题