Parsing a binary file. What is a modern way?

后端 未结 10 2028
悲哀的现实
悲哀的现实 2021-01-30 01:41

I have a binary file with some layout I know. For example let format be like this:

  • 2 bytes (unsigned short) - length of a string
  • 5 bytes (5 x chars) - the
相关标签:
10条回答
  • 2021-01-30 02:11

    Use a serialization library. Here are a few:

    • Boost serialization and Boost fusion
    • Cereal (my own library)
    • Another library called cereal (same name as mine but mine predates theirs)
    • Cap'n Proto
    0 讨论(0)
  • 2021-01-30 02:17

    I use ragel tool to generate pure C procedural source code (no tables) for microcontrollers with 1-2K of RAM. It did not use any file io, buffering, and produces both easy to debug code and .dot/.pdf file with state machine diagram.

    ragel can also output go, Java,.. code for parsing, but I did not use these features.

    The key feature of ragel is the ability to parse any byte-build data, but you can't dig into bit fields. Other problem is ragel able to parse regular structures but has no recursion and syntax grammar parsing.

    0 讨论(0)
  • 2021-01-30 02:24

    I personally do it this way:

    // some code which loads the file in memory
    #pragma pack(push, 1)
    struct someFile { int a, b, c; char d[0xEF]; };
    #pragma pack(pop)
    
    someFile* f = (someFile*) (file_in_memory);
    int filePropertyA = f->a;
    

    Very effective way for fixed-size structs at the start of the file.

    0 讨论(0)
  • 2021-01-30 02:28

    Currently I do it so:

    • load file to ifstream

    • read this stream to char buffer[2]

    • cast it to unsigned short: unsigned short len{ *((unsigned short*)buffer) };. Now I have length of a string.

    That last risks a SIGBUS (if your character array happens to start at an odd address and your CPU can only read 16-bit values that are aligned at an even address), performance (some CPUs will read misaligned values but slower; others like modern x86s are fine and fast) and/or endianness issues. I'd suggest reading the two characters then you can say (x[0] << 8) | x[1] or vice versa, using htons if needing to correct for endianness.

    • read a stream to vector<char> and create a std::string from this vector. Now I have string id.

    No need... just read directly into the string:

    std::string s(the_size, ' ');
    
    if (input_fstream.read(&s[0], s.size()) &&
        input_stream.gcount() == s.size())
        ...use s...
    
    • the same way read next 4 bytes and cast them to unsigned int. Now I have a stride. while not end of file read floats the same way - create a char bufferFloat[4] and cast *((float*)bufferFloat) for every float.

    Better to read the data directly over the unsigned ints and floats, as that way the compiler will ensure correct alignment.

    This works, but for me it looks ugly. Can I read directly to unsigned short or float or string etc. without char [x] creating? If no, what is the way to cast correctly (I read that style I'm using - is an old style)?

    struct Data
    {
        uint32_t x;
        float y[6];
    };
    Data data;
    if (input_stream.read((char*)&data, sizeof data) &&
        input_stream.gcount() == sizeof data)
        ...use x and y...
    

    Note the code above avoids reading data into potentially unaligned character arrays, wherein it's unsafe to reinterpret_cast data in a potentially unaligned char array (including inside a std::string) due to alignment issues. Again, you may need some post-read conversion with htonl if there's a chance the file content differs in endianness. If there's an unknown number of floats, you'll need to calculate and allocate sufficient storage with alignment of at least 4 bytes, then aim a Data* at it... it's legal to index past the declared array size of y as long as the memory content at the accessed addresses was part of the allocation and holds a valid float representation read in from the stream. Simpler - but with an additional read so possibly slower - read the uint32_t first then new float[n] and do a further read into there....

    Practically, this type of approach can work and a lot of low level and C code does exactly this. "Cleaner" high-level libraries that might help you read the file must ultimately be doing something similar internally....

    0 讨论(0)
提交回复
热议问题