Is there a way to get a numpy-style view to a slice of an array stored in a hdf5 file?

问题

I have to work on large 3D cubes of data. I want to store them in HDF5 files (using h5py or maybe pytables). I often want to perform analysis on just a section of these cubes. This section is too large to hold in memory. I would like to have a numpy style view to my slice of interest, without copying the data to memory (similar to what you could do with a numpy memmap). Is this possible? As far as I know, performing a slice using h5py, you get a numpy array in memory.

It has been asked why I would want to do this, since the data has to enter memory at some point anyway. My code, out of necessity, already run piecemeal over data from these cubes, pulling small bits into memory at a time. These functions are simplest if they simply iterate over the entirety of the datasets passed to them. If I could have a view to the data on disk, I simply could pass this view to these functions unchanged. If I cannot have a view, I need to write all my functions to only iterate over the slice of interest. This will add complexity to the code, and make it more likely for human error during analysis.

Is there any way to get a view to the data on disk, without copying to memory?

回答1:

One possibility is to create a generator that yields the elements of the slice one by one. Once you have such a generator, you can pass it to your existing code and iterate through the generator as normal. As an example, you can use a for loop on a generator, just as you might use it on a slice. Generators do not store all of their values at once, they 'generate' them as needed.

You might be able create a slice of just the locations of the cube you want, but not the data itself, or you could generate the next location of your slice programmatically if you have too many locations to store in memory as well. A generator could use those locations to yield the data they contain one by one.

Assuming your slices are the (possibly higher-dimensional) equivalent of cuboids, you might generate coordinates using nested for-range() loops, or by applying product() from the itertools module to range objects.

回答2:

It is unavoidable to not copy that section of the dataset to memory. Reason for that is simply because you are requesting the entire section, not just a small part of it. Therefore, it must be copied completely.

So, as h5py already allows you to use HDF5 datasets in the same way as NumPy arrays, you will have to change your code to only request the values in the dataset that you currently need.

来源：https://stackoverflow.com/questions/27803331/is-there-a-way-to-get-a-numpy-style-view-to-a-slice-of-an-array-stored-in-a-hdf5

标签

python

hdf5

pytables

h5py