What is the best beetween multiple small h5 files or one huge?

假装没事ソ 提交于 2020-08-07 04:54:19

问题


I'm working with huge sattelite data that i'm splitting into small tiles to feed a deep learning model. I'm using pytorch, which means the data loader can work with multiple thread. [settings : python, Ubuntu 18.04]

I can't find any answer of which is the best in term of data accessing and storage between :

  1. registering all the data in one huge HDF5 file (over 20Go)
  2. splitting it into multiple (over 16 000) small HDF5 files (approx 1.4Mo).

Is there any problem of multiple access of one file by multiple thread ? and in the other case is there an impact of having that amount of files ?


回答1:


I would go for multiple files if I were you (but read till the end).

Intuitively, you could load at least some files into memory speeding the process a little bit (it is unlikely you would able to do so with 20GB, if you are, than you definitely should as RAM access is much faster).

You could cache those examples (inside custom torch.utils.data.Dataset instance) during the first past and retrieve cached examples (say in list or other more memory-efficient data structure with better cache-locality preferably) instead of reading from disk (similar approach to the one in Tensorflow's tf.data.Dataset object and it's cache method).

On the other hand, this approach is more cumbersome and harder to implement correctly, though if you are only reading the file with multiple threads you should be fine and there shouldn't be any locks on this operation.

Remember to measure your approach with pytorch's profiler (torch.utils.bottleneck) to pinpoint exact problems and verify solutions.



来源:https://stackoverflow.com/questions/56884090/what-is-the-best-beetween-multiple-small-h5-files-or-one-huge

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!