Parsing a binary file. What is a modern way?

后端 未结 10 2038
悲哀的现实
悲哀的现实 2021-01-30 01:41

I have a binary file with some layout I know. For example let format be like this:

  • 2 bytes (unsigned short) - length of a string
  • 5 bytes (5 x chars) - the
相关标签:
10条回答
  • 2021-01-30 02:03

    I actually implemented a quick and dirty binary format parser to read .zip files (following Wikipedia's format description) just last month, and being modern I decided to use C++ templates.

    On some specific platforms, a packed struct could work, however there are things it does not handle well... such as fields of variable length. With templates, however, there is no such issue: you can get arbitrarily complex structures (and return types).

    A .zip archive is relatively simple, fortunately, so I implemented something simple. Off the top of my head:

    using Buffer = std::pair<unsigned char const*, size_t>;
    
    template <typename OffsetReader>
    class UInt16LEReader: private OffsetReader {
    public:
        UInt16LEReader() {}
        explicit UInt16LEReader(OffsetReader const or): OffsetReader(or) {}
    
        uint16_t read(Buffer const& buffer) const {
            OffsetReader const& or = *this;
    
            size_t const offset = or.read(buffer);
            assert(offset <= buffer.second && "Incorrect offset");
            assert(offset + 2 <= buffer.second && "Too short buffer");
    
            unsigned char const* begin = buffer.first + offset;
    
            // http://commandcenter.blogspot.fr/2012/04/byte-order-fallacy.html
            return (uint16_t(begin[0]) << 0)
                 + (uint16_t(begin[1]) << 8);
        }
    }; // class UInt16LEReader
    
    // Declined for UInt[8|16|32][LE|BE]...
    

    Of course, the basic OffsetReader actually has a constant result:

    template <size_t O>
    class FixedOffsetReader {
    public:
        size_t read(Buffer const&) const { return O; }
    }; // class FixedOffsetReader
    

    and since we are talking templates, you can switch the types at leisure (you could implement a proxy reader which delegates all reads to a shared_ptr which memoizes them).

    What is interesting, though, is the end-result:

    // http://en.wikipedia.org/wiki/Zip_%28file_format%29#File_headers
    class LocalFileHeader {
    public:
        template <size_t O>
        using UInt32 = UInt32LEReader<FixedOffsetReader<O>>;
        template <size_t O>
        using UInt16 = UInt16LEReader<FixedOffsetReader<O>>;
    
        UInt32< 0> signature;
        UInt16< 4> versionNeededToExtract;
        UInt16< 6> generalPurposeBitFlag;
        UInt16< 8> compressionMethod;
        UInt16<10> fileLastModificationTime;
        UInt16<12> fileLastModificationDate;
        UInt32<14> crc32;
        UInt32<18> compressedSize;
        UInt32<22> uncompressedSize;
    
        using FileNameLength = UInt16<26>;
        using ExtraFieldLength = UInt16<28>;
    
        using FileName = StringReader<FixedOffsetReader<30>, FileNameLength>;
    
        using ExtraField = StringReader<
            CombinedAdd<FixedOffsetReader<30>, FileNameLength>,
            ExtraFieldLength
        >;
    
        FileName filename;
        ExtraField extraField;
    }; // class LocalFileHeader
    

    This is rather simplistic, obviously, but incredibly flexible at the same time.

    An obvious axis of improvement would be to improve chaining since here there is a risk of accidental overlaps. My archive reading code worked the first time I tried it though, which was evidence enough for me that this code was sufficient for the task at hand.

    0 讨论(0)
  • 2021-01-30 02:04

    Since all of your data is variable, you can read the two blocks separately and still use casting:

    struct id_contents
    {
        uint16_t len;
        char id[];
    } __attribute__((packed)); // assuming gcc, ymmv
    
    struct data_contents
    {
        uint32_t stride;
        float data[];
    } __attribute__((packed)); // assuming gcc, ymmv
    
    class my_row
    {
        const id_contents* id_;
        const data_contents* data_;
        size_t len;
    
    public:
        my_row(const char* buffer) {
            id_= reinterpret_cast<const id_contents*>(buffer);
            size_ = sizeof(*id_) + id_->len;
            data_ = reinterpret_cast<const data_contents*>(buffer + size_);
            size_ += sizeof(*data_) + 
                data_->stride * sizeof(float); // or however many, 3*float?
    
        }
    
        size_t size() const { return size_; }
    };
    

    That way you can use Mr. kbok's answer to parse correctly:

    const char* buffer = getPointerToDataSomehow();
    
    my_row data1(buffer);
    buffer += data1.size();
    
    my_row data2(buffer);
    buffer += data2.size();
    
    // etc.
    
    0 讨论(0)
  • 2021-01-30 02:06

    If it is not for learning purpose, and if you have freedom in choosing the binary format you'd better consider using something like protobuf which will handle the serialization for you and allow to interoperate with other platforms and languages.

    If you cannot use a third party API, you may look at QDataStream for inspiration

    • Documentation
    • Source code
    0 讨论(0)
  • 2021-01-30 02:10

    The C way, which would work fine in C++, would be to declare a struct:

    #pragma pack(1)
    
    struct contents {
       // data members;
    };
    

    Note that

    • You need to use a pragma to make the compiler align the data as-it-looks in the struct;
    • This technique only works with POD types

    And then cast the read buffer directly into the struct type:

    std::vector<char> buf(sizeof(contents));
    file.read(buf.data(), buf.size());
    contents *stuff = reinterpret_cast<contents *>(buf.data());
    

    Now if your data's size is variable, you can separate in several chunks. To read a single binary object from the buffer, a reader function comes handy:

    template<typename T>
    const char *read_object(const char *buffer, T& target) {
        target = *reinterpret_cast<const T*>(buffer);
        return buffer + sizeof(T);
    }
    

    The main advantage is that such a reader can be specialized for more advanced c++ objects:

    template<typename CT>
    const char *read_object(const char *buffer, std::vector<CT>& target) {
        size_t size = target.size();
        CT const *buf_start = reinterpret_cast<const CT*>(buffer);
        std::copy(buf_start, buf_start + size, target.begin());
        return buffer + size * sizeof(CT);
    }
    

    And now in your main parser:

    int n_floats;
    iter = read_object(iter, n_floats);
    std::vector<float> my_floats(n_floats);
    iter = read_object(iter, my_floats);
    

    Note: As Tony D observed, even if you can get the alignment right via #pragma directives and manual padding (if needed), you may still encounter incompatibility with your processor's alignment, in the form of (best case) performance issues or (worst case) trap signals. This method is probably interesting only if you have control over the file's format.

    0 讨论(0)
  • 2021-01-30 02:11

    I had to solve this problem once. The data files were packed FORTRAN output. Alignments were all wrong. I succeeded with preprocessor tricks that did automatically what you are doing manually: unpack the raw data from a byte buffer to a struct. The idea is to describe the data in an include file:

    BEGIN_STRUCT(foo)
        UNSIGNED_SHORT(length)
        STRING_FIELD(length, label)
        UNSIGNED_INT(stride)
        FLOAT_ARRAY(3 * stride)
    END_STRUCT(foo)
    

    Now you can define these macros to generate the code you need, say the struct declaration, include the above, undef and define the macros again to generate unpacking functions, followed by another include, etc.

    NB I first saw this technique used in gcc for abstract syntax tree-related code generation.

    If CPP is not powerful enough (or such preprocessor abuse is not for you), substitute a small lex/yacc program (or pick your favorite tool).

    It's amazing to me how often it pays to think in terms of generating code rather than writing it by hand, at least in low level foundation code like this.

    0 讨论(0)
  • 2021-01-30 02:11

    You should better declare a structure (with 1-byte padding - how - depends on compiler). Write using that structure, and read using same structure. Put only POD in structure, and hence no std::string etc. Use this structure only for file I/O, or other inter-process communication - use normal struct or class to hold it for further use in C++ program.

    0 讨论(0)
提交回复
热议问题