What to put in a binary data file's header

前端 未结 12 2095
隐瞒了意图╮
隐瞒了意图╮ 2021-02-06 01:52

I have a simulation that reads large binary data files that we create (10s to 100s of GB). We use binary for speed reasons. These files are system dependent, converted from te

相关标签:
12条回答
  • 2021-02-06 02:07

    You might consider putting a file offset in a fixed position in the header, which tells you where the actual data begins in the file. This would let you change the size of the header when needed.

    In a couple of cases, I put the value 0x12345678 into the header so I could detect if the file format, matched the endianism of the machine that was processing it.

    0 讨论(0)
  • 2021-02-06 02:11

    For large files, you might want to add data definitions, so your file format becomes self-describing.

    0 讨论(0)
  • 2021-02-06 02:16

    In my experience, second-guessing the data you'll need is invariably wasted time. What's important is to structure your metadata in a way that is extensible. For XML files, that's straightforward, but binary files require a bit more thought.

    I tend to store metadata in a structure at the END of the file, not the beginning. This has two advantages:

    • Truncated/unterminated files are easily detected.
    • Metadata footers can often be appended to existing files without impacting their reading code.

    The simplest metadata footer I use looks something like this:

    struct MetadataFooter{
      char[40] creatorVersion;
      char[40] creatorApplication;
      .. or whatever
    } 
    
    struct FileFooter
    {
      int64 metadataFooterSize;  // = sizeof(MetadataFooter)
      char[10] magicString;   // a unique identifier for the format: maybe "MYFILEFMT"
    };
    

    After the raw data, the metadata footer and THEN the file footer are written.

    When reading the file, seek to the end - sizeof(FileFooter). Read the footer, and verify the magicString. Then, seek back according to metadataFooterSize and read the metadata. Depending on the footer size contained in the file, you can use default values for missing fields.

    As KeithB points out, you could even use this technique to store the metadata as an XML string, giving the advantages of both totally extensible metadata, with the compactness and speed of binary data.

    0 讨论(0)
  • 2021-02-06 02:16

    My variation combines Roddy and Jason S's approaches.

    In summary - put formatted text metadata at the end of the file with a way to determine its length stored elsewhere.

    1) Put an length field at the beginning of your file so you know the length of the metadata at the end rather than assuming a fixed length. That way, to get the metadata you just read that fixed-length initial field and then get the metadata blob from the end of file.

    2) Use XML or YAML or JSON for the metadata. This is especially useful/safe if the metadata is appended at the end because nobody reading the file is going to automatically think it's all XML just because it starts with XML.

    The only disadvantage in this approach is when your metadata grows, you have to update both the head of the file and the tail but it's likely other parts will have been updated anyway. If it's just updating trivia like a last-accessed date then the metadata length won't change so it only needs an update in-place.

    0 讨论(0)
  • 2021-02-06 02:17

    For large binaries, in addition to the version number I tend to put a record count and CRC, the reason being that large binaries are much more prone to get truncated and/or corrupted over time or during transfer than smaller ones. I found recently to my horror that Windows does not handle this well at all, as I used explorer to copy about 2TB across a couple of hundred files to an attached NAS device, and found 2-3 files on each copy were damaged (not completely copied).

    0 讨论(0)
  • 2021-02-06 02:20

    As my experience with telecom equipment configuration and firmware upgrades shows you only really need several predefined bytes at the begin (this is important) which starts from version (fixed part of header). Rest of header is optional, by indicating proper version you can always show how to process it. Important thing here is you'd better place 'variable' part of header at the end of file. If you plan operations on header without modifying file content itself. Also this simplify 'append' operations which should recalculate variable header part.

    Nice to have features for fixed size header (at the begin):

    • Common 'length' field (including header).
    • Something like CRC32 (including header).

    OK, for variable part XML or some pretty extensible format in header is good idea but is it really needed? I had lot of experience with ASN encoding... in most cases its usage was overshot.

    Well, maybe you will have additional understanding when you look at things like TPKT format which is described in RFC 2126 (chapter 4.3).

    0 讨论(0)
提交回复
热议问题