h5py: chunking on resizable dataset

雨燕双飞 提交于 2019-12-25 09:00:09

问题


I have a series of raster datasets which I want to combine into a single HDF5 file.

Each raster file will be converted into an array with the dimensions 3600 x 7000. As I have a total of 659 files, the final array would have a shape of 3600 x 7000 x 659, too big for my (huge) amount of RAM.

I'm fairly new to python and HDF5 itself, but basically my approach is to create a dataset with the required 2-d dimensions and then iteratively read the files into arrays and append to the dataset.

I'm planning to use chunking (which should decrease I/O time) in accordance with my planned use of the dataset. As it's a raster timeseries, I'm intending to divide the 3-d array into chunks alongside the first 2 dimensions, while always processing the dataset entirely alongside the third dimension.

I know that I can define the maxshape of the new dataset with maxshape = (rows,cols,None) in order to keep the dataset resizeable alongside the 3rd dimension when new rasterfiles (new timesteps) come in.

So my question now is, how do I specify the chunking accordingly? chunks=True gives chunks which are too small. Therefore I'm setting chunks=(nrow,ncol,359) ... with nrow and ncol being the dimensions of the block.

Is there any way to account in the chunks for the resizing alongside the third dimension (like chunks = (ncor,ncol,None))?

If not (and the third dimension is more than the specified chunk, but less than twice of it), is the best (fastest) way to read the data by chunk a la:

array1 = data[0:nrow,0:ncol,0:659]
array2 = data[0:nrow,0:ncol,659:]

Many thanks!

PS: If someone also has a suggestion how to more efficiently or elegantly do this, I'd also greatly appreciate any tips

Val

来源:https://stackoverflow.com/questions/43227009/h5py-chunking-on-resizable-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!