I have a binary file with some layout I know. For example let format be like this:
Use a serialization library. Here are a few:
I use ragel
tool to generate pure C procedural source code (no tables) for microcontrollers with 1-2K of RAM. It did not use any file io, buffering, and produces both easy to debug code and .dot/.pdf file with state machine diagram.
ragel can also output go, Java,.. code for parsing, but I did not use these features.
The key feature of ragel
is the ability to parse any byte-build data, but you can't dig into bit fields. Other problem is ragel able to parse regular structures but has no recursion and syntax grammar parsing.
I personally do it this way:
// some code which loads the file in memory
#pragma pack(push, 1)
struct someFile { int a, b, c; char d[0xEF]; };
#pragma pack(pop)
someFile* f = (someFile*) (file_in_memory);
int filePropertyA = f->a;
Very effective way for fixed-size structs at the start of the file.
Currently I do it so:
load file to ifstream
read this stream to char buffer[2]
cast it to
unsigned short
:unsigned short len{ *((unsigned short*)buffer) };
. Now I have length of a string.
That last risks a SIGBUS
(if your character array happens to start at an odd address and your CPU can only read 16-bit values that are aligned at an even address), performance (some CPUs will read misaligned values but slower; others like modern x86s are fine and fast) and/or endianness issues. I'd suggest reading the two characters then you can say (x[0] << 8) | x[1]
or vice versa, using htons if needing to correct for endianness.
- read a stream to
vector<char>
and create astd::string
from thisvector
. Now I have string id.
No need... just read directly into the string:
std::string s(the_size, ' ');
if (input_fstream.read(&s[0], s.size()) &&
input_stream.gcount() == s.size())
...use s...
- the same way
read
next 4 bytes and cast them tounsigned int
. Now I have a stride.while
not end of fileread
float
s the same way - create achar bufferFloat[4]
and cast*((float*)bufferFloat)
for everyfloat
.
Better to read the data directly over the unsigned int
s and floats
, as that way the compiler will ensure correct alignment.
This works, but for me it looks ugly. Can I read directly to
unsigned short
orfloat
orstring
etc. withoutchar [x]
creating? If no, what is the way to cast correctly (I read that style I'm using - is an old style)?
struct Data
{
uint32_t x;
float y[6];
};
Data data;
if (input_stream.read((char*)&data, sizeof data) &&
input_stream.gcount() == sizeof data)
...use x and y...
Note the code above avoids reading data into potentially unaligned character arrays, wherein it's unsafe to reinterpret_cast
data in a potentially unaligned char
array (including inside a std::string
) due to alignment issues. Again, you may need some post-read conversion with htonl
if there's a chance the file content differs in endianness. If there's an unknown number of float
s, you'll need to calculate and allocate sufficient storage with alignment of at least 4 bytes, then aim a Data*
at it... it's legal to index past the declared array size of y
as long as the memory content at the accessed addresses was part of the allocation and holds a valid float
representation read in from the stream. Simpler - but with an additional read so possibly slower - read the uint32_t
first then new float[n]
and do a further read
into there....
Practically, this type of approach can work and a lot of low level and C code does exactly this. "Cleaner" high-level libraries that might help you read the file must ultimately be doing something similar internally....