h5py not sticking to chunking specification?

前端未结

关注

 1  653

Problem: I have existing netCDF4 files (about 5000 of them), (typically in shape 96x3712x3712) datapoints (float32). These are files with the first dimension being time (1 f

相关标签:

1条回答

我寻月下人不归

2020-12-04 03:35

The influence of chunk size

In a worst case scenario reading and writing one chunk can be considered as random read/write operation. The main advantage of a SSD is the speed of reading or writing small chunks of data. A HDD is much slower at this task (a factor 100 can be observed), a NAS can even be much slower than a HDD.

So the solution of the problem will be a larger chunk size. Some benchmarks on my system (Core i5-4690).

Exampe_1 (chunk size (1,29,29)=3,4 kB):

import numpy as np
import tables #needed for blosc
import h5py as h5
import time
import h5py_cache as h5c

def original_chunk_size():
    File_Name_HDF5='some_Path'
    #Array=np.zeros((1,3712,3712),dtype=np.float32)
    Array=np.random.rand(96,3712,3712)

    f = h5.File(File_Name_HDF5, 'a',libver='latest')
    f.swmr_mode = True
    nodays=1

    shape = 96*nodays, 3712, 3712
    d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(1,29,29),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)

    #Writing
    t1=time.time()
    for i in xrange(0,96*nodays):
        d[i:i+1,:,:]=Array

    f.close()
    print(time.time()-t1)

    #Reading
    f = h5.File(File_Name_HDF5, 'a',libver='latest')
    f.swmr_mode = True
    d=f['variable']

    for i in xrange(0,3712,29):
        for j in xrange(0,3712,29):
            A=np.copy(d[:,i:i+29,j:j+29])

    print(time.time()-t1)

Results (write/read):

SSD: 38s/54s

HDD: 40s/57s

NAS: 252s/823s

In the second example I will use h5py_chache because I wan't to maintain providing chunks of (1,3712,3712). The standard chunk-chache-size is only one MB so it has to be changed, to avoid multiple read/write operations on chunks. https://pypi.python.org/pypi/h5py-cache/1.0

Exampe_2 (chunk size (96,58,58)=1,3 MB):

import numpy as np
import tables #needed for blosc
import h5py as h5
import time
import h5py_cache as h5c

def modified_chunk_size():
    File_Name_HDF5='some_Path'
    Array=np.random.rand(1,3712,3712)

    f = h5c.File(File_Name_HDF5, 'a',libver='latest', 
    chunk_cache_mem_size=6*1024**3)
    f.swmr_mode = True
    nodays=1

    shape = 96*nodays, 3712, 3712
    d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)

    #Writing
    t1=time.time()
    for i in xrange(0,96*nodays):
        d[i:i+1,:,:]=Array

    f.close()
    print(time.time()-t1)

    #Reading
    f = h5c.File(File_Name_HDF5, 'a',libver='latest', chunk_cache_mem_size=6*1024**3) #6 GB chunk chache
    f.swmr_mode = True
    d=f['variable']

    for i in xrange(0,3712,58):
        for j in xrange(0,3712,58):
            A=np.copy(d[:,i:i+58,j:j+58])

    print(time.time()-t1)

Results (write/read):

SSD: 10s/16s

HDD: 10s/16s

NAS: 13s/20s

The read/write speed can further be improved by mininimizing the api calls (reading and writing larger chunk blocks).

I also want't to mention her the compression method. Blosc can achieve up to 1GB/s throughput (CPU bottlenecking) gzip is slower, but provides better compression ratios.

d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=3)

20s/30s file size: 101 MB

d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=6)

50s/58s file size: 87 MB

d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=9)

50s/60s file size: 64 MB

And now a benchmark of a whole month (30 days). The writing is a bit optimized and is written with (96,3712, 3712).

def modified_chunk_size():
    File_Name_HDF5='some_Path'

    Array_R=np.random.rand(1,3712,3712)
    Array=np.zeros((96,3712,3712),dtype=np.float32)
    for j in xrange(0,96):
        Array[j,:,:]=Array_R

    f = h5.File(File_Name_HDF5, 'a',libver='latest')
    f.swmr_mode = True
    nodays=30

    shape = 96, 3712, 3712
    d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)

    #Writing
    t1=time.time()
    for i in xrange(0,96*nodays,96):
        d[i:i+96,:,:]=Array
        d.resize((d.shape[0]+96,shape[1],shape[2]))

    f.close()
    print(time.time()-t1)

    #Reading
    f = h5.File(File_Name_HDF5, 'a',libver='latest')
    f.swmr_mode = True
    d=f['variable']
    for i in xrange(0,3712,58):
        for j in xrange(0,3712,58):
            A=np.copy(d[:,i:i+58,j:j+58])

    print(time.time()-t1)

133s/301s with blosc

432s/684s with gzip compression_opts=3

I had the same problems when accessing data on a NAS. I hope this helps...

0 讨论(0)