I do a lot of statistical work and use Python as my main language. Some of the data sets I work with though can take 20GB of memory, which makes operating on them using in-memor
We use Python in conjunction with h5py, numpy/scipy and boost::python to do data analysis. Our typical datasets have sizes of up to a few hundred GBs.
HDF5 advantages:
HDF5 pitfalls:
This is a long comment, not an answer to your actual question about h5py. I don't use Python for stats and tend to deal with relatively small datasets, but it might be worth a moment to check out the CRAN Task View for high-performance computing in R, especially the "Large memory and out-of-memory data" section.
Three reasons:
Again, I emphasize that this is all way out of my league, and it's certainly possible that you might already know all of this. But perhaps this will prove useful to you or someone working on the same problems.