What to put in a binary data file's header

前端 未结 12 2000
隐瞒了意图╮
隐瞒了意图╮ 2021-02-06 01:52

I have a simulation that reads large binary data files that we create (10s to 100s of GB). We use binary for speed reasons. These files are system dependent, converted from te

相关标签:
12条回答
  • 2021-02-06 01:55

    If they're that large, I'd reserve a healthy chunk (64K?) of space at the beginning of the file and put the metadata there in XML format followed by an end-of-file character (Ctrl-Z for DOS/Windows, ctrl-D for unix?). That way you can examine and parse the metadata easily with the wide range of toolsets out there for XML.

    Otherwise I go with what other people have already said: timestamp for file creation, identifier for which machine it's created on, basically anything else that you can think of for diagnostic purposes. And ideally you would include the definition of the structure format itself. If you are changing the structure often, it's a big pain to maintain the proper version of code around to read various formats of old datafiles.

    One big advantage of HDF5 as @highpercomp has mentioned, is that you just don't need to worry about changes in the structure format, as long as you have some convention of what the names and datatypes are. The structure names and datatypes are all stored in the file itself, so you can blow your C code to smithereens and it doesn't matter, you can still retrieve data from an HDF5 file. It lets you worry less about the format of data and more on the structure of data, i.e. I don't care about the sequence of bytes, that's HDF5's problem, but I do care about field names and the like.

    Another reason I like HDF5 is you can choose to use compression, which takes a very small amount of time and can give you huge wins in storage space if the data is slowly-changing or mostly the same except for a few errant blips of interestingness.

    0 讨论(0)
  • 2021-02-06 01:56

    In addition to whatever information you need for schema versioning, add details that may be of value if you are troubleshooting an issue. For example:

    • timestamps of when the file was created and update (if applicable).
    • the version string from the build (ideally you have a version string that is auto-incremented on every 'official' build ... this is different to the file schema version).
    • the name of the system creating the file, and maybe other statistics that are relevant to your app

    We find this is very useful (a) in getting information we would otherwise have to ask the customer to provide and (b) getting correct information -- it is amazing how many customers report they are running a different version of the software to what the data claims!

    0 讨论(0)
  • 2021-02-06 01:59

    If you are putting a version number in the header you can change that version anytime you need to change the POD struct or add new fields to the header.

    So don't add stuff to the header now because it might be interesting. You are just creating code that you have to maintain but that has little real value.

    0 讨论(0)
  • 2021-02-06 02:00

    @rstevens said 'an identifier for the type of file'...sound advice. Conventionally, that's called a magic number and, in a file, isn't a term of abuse (unlike in code, where it is a term of abuse). Basically, it is some number - typically at least 4 bytes, and I usually ensure that at least one of those bytes is not ASCII - that you can use to validate that the file is of the type you expect with a low probability of being confused. You can also write a rule in /etc/magic (or local equivalent) to report that files containing your magic number are your special file type.

    You should include a file format version number. However, I would recommend not using the SVN number of the code. Your code may change when the file format does not.

    0 讨论(0)
  • 2021-02-06 02:05

    An identifier for the type of the file would be useful if you will have other structures written to binary files later on. Maybe this could be a short string so you can see by a look into the file (via hex editor) what it contains.

    0 讨论(0)
  • 2021-02-06 02:07

    For large binaries I'd look seriously at HDF5 (Google for it). Even if it's not something you want to adopt it might point you in some useful directions in designing your own formats.

    0 讨论(0)
提交回复
热议问题