How is HDF5 different from a folder with files?

后端 未结 9 1831
予麋鹿
予麋鹿 2021-01-29 22:43

I\'m working on an open source project dealing with adding metadata to folders. The provided (Python) API lets you browse and access metadata like it was just another folder. Be

相关标签:
9条回答
  • 2021-01-29 22:50

    I'm currently evaluating HDF5 so had the same question.

    This article – Moving Away from HDF5 – asks pretty much the same question. The article raises some good points about the fact that there is only a single implementation of the HDF5 library which is developed in relatively opaque circumstances by modern open-source standards.

    As you can tell from the title, the authors decided to move away from HDF5, to a filesystem hierarchy of binary files containing arrays with metadata in JSON files. This was in spite having made a significant investment in HDF5, having had their fingers burnt by data corruption and performance issues.

    0 讨论(0)
  • 2021-01-29 22:55

    A game where you need to load a lot of resources into the memory would be a scenario in which an HDF5 may be better than a folder with files. Loading data from files has costs as seek time, the time required to open each file, and read data from the file into memory. These operations can be even slower when reading data from a DVD or Blu-ray. Opening a single file can reduce drastically those costs.

    0 讨论(0)
  • 2021-01-29 22:55

    One factor to consider is performance of disk access. Using hd5f, everything is stored in continuous area of disk, making reading data faster with fewer disk seek and rotation. On the other hand, using file system to organize data may involve reading from many small files, thus more disk access is required.

    0 讨论(0)
  • 2021-01-29 22:56

    HDF5 is ultimately, a format to store numbers, optimised for large datasets. The main strengths are the support for compression (that can make reading and writing data faster in many circumstances) and the fast in-kernel queries (retrieval of data fulfilling certain conditions, for example, all the values of pressure when the temperature was over 30 C).

    The fact that you can combine several datasets in the same file is just a convenience. For example, you could have several groups corresponding to different weather stations, and each group consisting on several tables of data. For each group you would have a set of attributes describing the details of the instruments, and each table the individual settings. You can have one h5 file for each block of data, with an attribute in the corresponding place and it would give you the same functionality. But now, what you can do with HDF5 is to repack the file for optimized querying, compress the whole thing slightly, and retrieve your information at a blazing speed. If you have several files, each one would be individually compressed, and the OS would decide the layout on disk, that may not be the optimal.

    One last thing HDF5 allows you is to load a file (or a piece) in memory exposing the same API as in disk. So, for example, you could use one or other backend depending on the size of the data and the available RAM. In your case, that would be equivalent as copying the relevant information to /dev/shm in Linux, and you would be responsible for commiting back to disk any modification.

    0 讨论(0)
  • 2021-01-29 23:08

    Yes, the main advantage is that HDF5 is portable. HDF5 files can be accessed by a host of other programming/interpreting languages, such as Python (which your API is built on), MATLAB, Fortran and C. As Simon suggested, HDF5 is used extensively in the scientific community to store large datasets. In my experience, I find the ability to retrieve only certain datasets (and regions) useful. In addition, the building the HDF5 library for parallel I/O is very advantageous for post-processing of raw data at a later time.

    Since the file is also self-describing, it is capable of storing not just raw data, but also description of that data, such as the array size, array name, units and a host of additional metadata.

    Hope this helps.

    0 讨论(0)
  • 2021-01-29 23:09

    I think the main advantage is portability.

    HDF5 stores information about your datasets like the size, type and endianness of integers and floating point numbers, which means you can move an hdf5 file around and read its content even if it was created on a machine with a different architecture.

    You can also attach arbitrary metadata to groups and datasets. Arguably you can also do that with files and folders if your filesystem support extended attributes.

    An hdf5 file is a single file which can sometimes be more convenient than having to zip/tar folders and files. There is also a major drawback to this: if you delete a dataset, you can't reclaim the space without creating a new file.

    Generally, HDF5 is well suited for storing large arrays of numbers, typically scientific datasets.

    0 讨论(0)
提交回复
热议问题