问题
I have a series of raster datasets which I want to combine into a single HDF5 file.
Each raster file will be converted into an array with the dimensions 3600 x 7000
. As I have a total of 659 files, the final array would have a shape of 3600 x 7000 x 659
, too big for my (huge) amount of RAM.
I'm fairly new to python and HDF5 itself, but basically my approach is to create a dataset with the required 2-d dimensions and then iteratively read the files into arrays and append to the dataset.
I'm planning to use chunking (which should decrease I/O time) in accordance with my planned use of the dataset. As it's a raster timeseries, I'm intending to divide the 3-d array into chunks alongside the first 2 dimensions, while always processing the dataset entirely alongside the third dimension.
I know that I can define the maxshape
of the new dataset with maxshape = (rows,cols,None)
in order to keep the dataset resizeable alongside the 3rd dimension when new rasterfiles (new timesteps) come in.
So my question now is, how do I specify the chunking accordingly? chunks=True
gives chunks which are too small.
Therefore I'm setting chunks=(nrow,ncol,359)
... with nrow and ncol being the dimensions of the block.
Is there any way to account in the chunks for the resizing alongside the third dimension (like chunks = (ncor,ncol,None)
)?
If not (and the third dimension is more than the specified chunk, but less than twice of it), is the best (fastest) way to read the data by chunk a la:
array1 = data[0:nrow,0:ncol,0:659]
array2 = data[0:nrow,0:ncol,659:]
Many thanks!
PS: If someone also has a suggestion how to more efficiently or elegantly do this, I'd also greatly appreciate any tips
Val
来源:https://stackoverflow.com/questions/43227009/h5py-chunking-on-resizable-dataset