Experience with using h5py to do analytical work on big data in Python?

后端未结

关注

 2  1864

执念已碎 2021-01-29 23:23

I do a lot of statistical work and use Python as my main language. Some of the data sets I work with though can take 20GB of memory, which makes operating on them using in-memor

2条回答

说谎 (楼主)

2021-01-30 00:16
We use Python in conjunction with h5py, numpy/scipy and boost::python to do data analysis. Our typical datasets have sizes of up to a few hundred GBs.

HDF5 advantages:
- data can be inspected conveniently using the h5view application, h5py/ipython and the h5* commandline tools
- APIs are available for different platforms and languages
- structure data using groups
- annotating data using attributes
- worry-free built-in data compression
- io on single datasets is fast
HDF5 pitfalls:
- Performance breaks down, if a h5 file contains too many datasets/groups (> 1000), because traversing them is very slow. On the other side, io is fast for a few big datasets.
- Advanced data queries (SQL like) are clumsy to implement and slow (consider SQLite in that case)
- HDF5 is not thread-safe in all cases: one has to ensure, that the library was compiled with the correct options
- changing h5 datasets (resize, delete etc.) blows up the file size (in the best case) or is impossible (in the worst case) (the whole h5 file has to be copied to flatten it again)
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...