问题
Problem: I have existing netCDF4 files (about 5000 of them), (typically in shape 96x3712x3712) datapoints (float32). These are files with the first dimension being time (1 file per day), the second and third spatial dimensions. Currently, making a slice over the first dimension (even a partial slice), would take a lot of time because of the following reasons:
- the netCDF files are chunked with a chunksize of 1x3712x3712. Slicing over the time dimension basically would read the entire file.
- looping (even in multiple processes) over all of the smaller files would take a lot of time as well.
My goal:
- create monthly files (about 2900x3712x3712) data points
- optimize them for slicing in time dimension (chunksize of 2900x1x1 or slightly bigger in the spatial dimensions)
Other requirements:
- the files should be appendable by a single timestamp (1x3712x3712), and this update process should take less than 15min
- the query should be fast enough: a full slice over time in less than one second (that is 2900x1x1)==> not so much data in fact...
- preferably the files should be accessible for read by multiple processes while being updated
- processing the historical data (the other 5000 daily files) should take less than a couple of weeks preferably.
I tried already multiple approaches:
- concatenating netcdf files and rechunking them==> takes too much memory and too much time...
- writing them from pandas to an hdf file (using pytables)==> creates a wide table with a huge index. This will eventually take too much time as well to read and requires the dataset to be tiled over the spatial dimensions due to meta data constraints.
- my last approach was writing them to an hdf5 file using h5py:
Here's the code to create a single monthly file:
import h5py
import pandas as pd
import numpy as np
def create_h5(fps):
timestamps=pd.date_range("20050101",periods=31*96,freq='15T') #Reference time period
output_fp = r'/data/test.h5'
try:
f = h5py.File(output_fp, 'a',libver='latest')
shape = 96*nodays, 3712, 3712
d = f.create_dataset('variable', shape=(1,3712,3712), maxshape=(None,3712,3712),dtype='f', compression='gzip', compression_opts=9,chunks=(1,29,29))
f.swmr_mode = True
for fp in fps:
try:
nc=Dataset(fp)
times = num2date(nc.variables['time'][:], nc.variables['time'].units)
indices=np.searchsorted(timestamps, times)
for j,time in enumerate(times):
logger.debug("File: {}, timestamp: {:%Y%m%d %H:%M}, pos: {}, new_pos: {}".format(os.path.basename(fp),time,j,indices[j]))
d.resize((indices[j]+1,shape[1],shape[2]))
d[indices[j]]=nc.variables['variable'][j:j+1]
f.flush()
finally:
nc.close()
finally:
f.close()
return output_fp
I'm using the last version of HDF5 to have the SWMR option. The fps argument is a list of file paths of the daily netCDF4 files. It creates the file (on an ssd, but I see that creating the file is mainly CPU bound) in about 2 hours, which is acceptable.
I have compression set up to keep the file size within limits. I did earlier tests without, and saw that the creation without is a bit faster but the slicing takes not so much longer with compression. H5py automatically chunks the dataset in 1x116x116 chunks.
Now the problem: slicing on a NAS with RAID 6 setup, takes about 20seconds to slice the time dimension, even though it is in a single chunk...
I figure, that even though it is in a single chunk in the file, because I wrote all of the values in a loop, it must be fragmented some how (don't know how this process works though). This is why I tried to do a h5repack using the CML tools of HDF5 into a new file, with the same chunks but hopefully reordering the values so that the query is able to read the values in a more sequential order, but no luck. Even though this process took 6h to run, it didn't do a thing on the query speed.
If I do my calculations right, reading one chunk (2976x32x32) is only a few MB big (11MB uncompressed, only a bit more than 1MB compressed I think). How can this take so long? What am I doing wrong? I would be glad if someone can shine a light on what is actually going on behind the scenes...
回答1:
The influence of chunk size
In a worst case scenario reading and writing one chunk can be considered as random read/write operation. The main advantage of a SSD is the speed of reading or writing small chunks of data. A HDD is much slower at this task (a factor 100 can be observed), a NAS can even be much slower than a HDD.
So the solution of the problem will be a larger chunk size. Some benchmarks on my system (Core i5-4690).
Exampe_1 (chunk size (1,29,29)=3,4 kB):
import numpy as np
import tables #needed for blosc
import h5py as h5
import time
import h5py_cache as h5c
def original_chunk_size():
File_Name_HDF5='some_Path'
#Array=np.zeros((1,3712,3712),dtype=np.float32)
Array=np.random.rand(96,3712,3712)
f = h5.File(File_Name_HDF5, 'a',libver='latest')
f.swmr_mode = True
nodays=1
shape = 96*nodays, 3712, 3712
d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(1,29,29),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
#Writing
t1=time.time()
for i in xrange(0,96*nodays):
d[i:i+1,:,:]=Array
f.close()
print(time.time()-t1)
#Reading
f = h5.File(File_Name_HDF5, 'a',libver='latest')
f.swmr_mode = True
d=f['variable']
for i in xrange(0,3712,29):
for j in xrange(0,3712,29):
A=np.copy(d[:,i:i+29,j:j+29])
print(time.time()-t1)
Results (write/read):
SSD: 38s/54s
HDD: 40s/57s
NAS: 252s/823s
In the second example I will use h5py_chache because I wan't to maintain providing chunks of (1,3712,3712). The standard chunk-chache-size is only one MB so it has to be changed, to avoid multiple read/write operations on chunks. https://pypi.python.org/pypi/h5py-cache/1.0
Exampe_2 (chunk size (96,58,58)=1,3 MB):
import numpy as np
import tables #needed for blosc
import h5py as h5
import time
import h5py_cache as h5c
def modified_chunk_size():
File_Name_HDF5='some_Path'
Array=np.random.rand(1,3712,3712)
f = h5c.File(File_Name_HDF5, 'a',libver='latest',
chunk_cache_mem_size=6*1024**3)
f.swmr_mode = True
nodays=1
shape = 96*nodays, 3712, 3712
d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
#Writing
t1=time.time()
for i in xrange(0,96*nodays):
d[i:i+1,:,:]=Array
f.close()
print(time.time()-t1)
#Reading
f = h5c.File(File_Name_HDF5, 'a',libver='latest', chunk_cache_mem_size=6*1024**3) #6 GB chunk chache
f.swmr_mode = True
d=f['variable']
for i in xrange(0,3712,58):
for j in xrange(0,3712,58):
A=np.copy(d[:,i:i+58,j:j+58])
print(time.time()-t1)
Results (write/read):
SSD: 10s/16s
HDD: 10s/16s
NAS: 13s/20s
The read/write speed can further be improved by mininimizing the api calls (reading and writing larger chunk blocks).
I also want't to mention her the compression method. Blosc can achieve up to 1GB/s throughput (CPU bottlenecking) gzip is slower, but provides better compression ratios.
d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=3)
20s/30s file size: 101 MB
d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=6)
50s/58s file size: 87 MB
d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression='gzip', compression_opts=9)
50s/60s file size: 64 MB
And now a benchmark of a whole month (30 days). The writing is a bit optimized and is written with (96,3712, 3712).
def modified_chunk_size():
File_Name_HDF5='some_Path'
Array_R=np.random.rand(1,3712,3712)
Array=np.zeros((96,3712,3712),dtype=np.float32)
for j in xrange(0,96):
Array[j,:,:]=Array_R
f = h5.File(File_Name_HDF5, 'a',libver='latest')
f.swmr_mode = True
nodays=30
shape = 96, 3712, 3712
d = f.create_dataset('variable', shape, maxshape=(None,3712,3712),dtype='f',chunks=(96,58,58),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
#Writing
t1=time.time()
for i in xrange(0,96*nodays,96):
d[i:i+96,:,:]=Array
d.resize((d.shape[0]+96,shape[1],shape[2]))
f.close()
print(time.time()-t1)
#Reading
f = h5.File(File_Name_HDF5, 'a',libver='latest')
f.swmr_mode = True
d=f['variable']
for i in xrange(0,3712,58):
for j in xrange(0,3712,58):
A=np.copy(d[:,i:i+58,j:j+58])
print(time.time()-t1)
133s/301s with blosc
432s/684s with gzip compression_opts=3
I had the same problems when accessing data on a NAS. I hope this helps...
来源:https://stackoverflow.com/questions/44881895/h5py-not-sticking-to-chunking-specification