Problem: I have existing netCDF4 files (about 5000 of them), (typically in shape 96x3712x3712) datapoints (float32). These are files with the first dimension being time (1 f
The influence of chunk size
In a worst case scenario reading and writing one chunk can be considered as random read/write operation. The main advantage of a SSD is the speed of reading or writing small chunks of data. A HDD is much slower at this task (a factor 100 can be observed), a NAS can even be much slower than a HDD.
So the solution of the problem will be a larger chunk size. Some benchmarks on my system (Core i5-4690).
Exampe_1 (chunk size (1,29,29)=3,4 kB):
import numpy as np
import tables #needed for blosc
import h5py as h5
import time
import h5py_cache as h5c
def original_chunk_size():
File_Name_HDF5='some_Path'
#Array=np.zeros((1,3712,3712),dtype=np.float32)
Array=np.random.rand(96,3712,3712)
f = h5.File(File_Name_HDF5, 'a',libver='latest')
f.swmr_mode = True
nodays=1
shape = 96*nodays, 3712, 3712
d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(1,29,29),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
#Writing
t1=time.time()
for i in xrange(0,96*nodays):
d[i:i+1,:,:]=Array
f.close()
print(time.time()-t1)
#Reading
f = h5.File(File_Name_HDF5, 'a',libver='latest')
f.swmr_mode = True
d=f['variable']
for i in xrange(0,3712,29):
for j in xrange(0,3712,29):
A=np.copy(d[:,i:i+29,j:j+29])
print(time.time()-t1)
Results (write/read):
SSD: 38s/54s
HDD: 40s/57s
NAS: 252s/823s
In the second example I will use h5py_chache because I wan't to maintain providing chunks of (1,3712,3712). The standard chunk-chache-size is only one MB so it has to be changed, to avoid multiple read/write operations on chunks. https://pypi.python.org/pypi/h5py-cache/1.0
Exampe_2 (chunk size (96,58,58)=1,3 MB):
import numpy as np
import tables #needed for blosc
import h5py as h5
import time
import h5py_cache as h5c
def modified_chunk_size():
File_Name_HDF5='some_Path'
Array=np.random.rand(1,3712,3712)
f = h5c.File(File_Name_HDF5, 'a',libver='latest',
chunk_cache_mem_size=6*1024**3)
f.swmr_mode = True
nodays=1
shape = 96*nodays, 3712, 3712
d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
#Writing
t1=time.time()
for i in xrange(0,96*nodays):
d[i:i+1,:,:]=Array
f.close()
print(time.time()-t1)
#Reading
f = h5c.File(File_Name_HDF5, 'a',libver='latest', chunk_cache_mem_size=6*1024**3) #6 GB chunk chache
f.swmr_mode = True
d=f['variable']
for i in xrange(0,3712,58):
for j in xrange(0,3712,58):
A=np.copy(d[:,i:i+58,j:j+58])
print(time.time()-t1)
Results (write/read):
SSD: 10s/16s
HDD: 10s/16s
NAS: 13s/20s
The read/write speed can further be improved by mininimizing the api calls (reading and writing larger chunk blocks).
I also want't to mention her the compression method. Blosc can achieve up to 1GB/s throughput (CPU bottlenecking) gzip is slower, but provides better compression ratios.
d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=3)
20s/30s file size: 101 MB
d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=6)
50s/58s file size: 87 MB
d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=9)
50s/60s file size: 64 MB
And now a benchmark of a whole month (30 days). The writing is a bit optimized and is written with (96,3712, 3712).
def modified_chunk_size():
File_Name_HDF5='some_Path'
Array_R=np.random.rand(1,3712,3712)
Array=np.zeros((96,3712,3712),dtype=np.float32)
for j in xrange(0,96):
Array[j,:,:]=Array_R
f = h5.File(File_Name_HDF5, 'a',libver='latest')
f.swmr_mode = True
nodays=30
shape = 96, 3712, 3712
d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
#Writing
t1=time.time()
for i in xrange(0,96*nodays,96):
d[i:i+96,:,:]=Array
d.resize((d.shape[0]+96,shape[1],shape[2]))
f.close()
print(time.time()-t1)
#Reading
f = h5.File(File_Name_HDF5, 'a',libver='latest')
f.swmr_mode = True
d=f['variable']
for i in xrange(0,3712,58):
for j in xrange(0,3712,58):
A=np.copy(d[:,i:i+58,j:j+58])
print(time.time()-t1)
133s/301s with blosc
432s/684s with gzip compression_opts=3
I had the same problems when accessing data on a NAS. I hope this helps...