Often I find myself having to represent a structure that consists of very small values. For example, Foo
has 4 values, a, b, c, d
that, range from
Let's say, you have a memory bus that's a little bit older and can deliver 10 GB/s. Now take a CPU at 2.5 GHz, and you see that you would need to handle at least four bytes per cycle to saturate the memory bus. As such, when you use the definition of
struct Foo {
char a;
char b;
char c;
char d;
}
and use all four variables in each pass through the data, your code will be CPU bound. You can't gain any speed by a denser packing.
Now, this is different when each pass only performs a trivial operation on one of the four values. In that case, you are better off with a struct of arrays:
struct Foo {
size_t count;
char* a; //a[count]
char* b; //b[count]
char* c; //c[count]
char* d; //d[count]
}
You've stated the common and ambiguous C/C++ tag.
Assuming C++, make the data private and add getters/ setters. No, that will not cause a performance hit - providing the optimizer is turned on.
You can then change the implementation to use the alternatives without any change to your calling code - and therefore more easily finesse the implementation based on the results of the bench tests.
For the record, I'd expect the struct
with bit fields as per @dbush to be most likely the fastest given your description.
Note all this is around keeping the data in cache - you may also want to see if the design of the calling algorithm can help with that.
There is no definitive answer, and you haven't given enough information to allow a "right" choice to be made. There are trade-offs.
Your statement that your "primary goal is time efficiency" is insufficient, since you haven't specified whether I/O time (e.g. to read data from file) is more of a concern than computational efficiency (e.g. how long some set of computations take after a user hits a "Go" button).
So it might be appropriate to write the data as a single char (to reduce time to read or write) but unpack it into an array of four int
(so subsequent calculations go faster).
Also, there is no guarantee that an int
is 32 bits (which you have assumed in your statement that the first packing uses 128 bits). An int
can be 16 bits.
Fitting your data set in cache is critical. Smaller is always better, because hyperthreading competitively shares the per-core caches between the hardware threads (on Intel CPUs). Comments on this answer include some numbers for costs of cache misses.
On x86, loading 8bit values with sign or zero-extension into 32 or 64bit registers (movzx
or movsx
) is literally just as fast as plain mov
of a byte or 32bit dword. Storing the low byte of a 32bit register also has no overhead. (See Agner Fog's instruction tables and C / asm optimization guides here).
Still x86-specific: [u]int8_t
temporaries are ok, too, but avoid [u]int16_t
temporaries. (load/store from/to [u]int16_t
in memory is fine, but working with 16bit values in registers has big penalties from the operand-size prefix decoding slowly on Intel CPUs.) 32bit temporaries will be faster if you want to use them as an array index. (Using 8bit registers doesn't zero the high 24/56bits, so it takes an extra instruction to zero or sign-extend, to use an 8bit register as an array index, or in an expression with a wider type (like adding it to an int
.)
I'm unsure what ARM or other architectures can do as far as efficient zero/sign extension from single-byte loads, or for single-byte stores.
Given this, my recommendation is pack for storage, use int
for temporaries. (Or long
, but that will increase code size slightly on x86-64, because a REX
prefix is needed to specify a 64bit operand size.) e.g.
int a_i = foo[i].a;
int b_i = foo[i].b;
...;
foo[i].a = a_i + b_i;
Packing into bitfields will have more overhead, but can still be worth it. Testing a compile-time-constant-bit-position (or multiple bits) in a byte or 32/64bit chunk of memory is fast. If you actually need to unpack some bitfields into int
s and pass them to a non-inline function call or something, that will take a couple extra instructions to shift and mask. If this gives even a small reduction in cache misses, this can be worth it.
Testing, setting (to 1) or clearing (to 0) a bit or group of bits can be done efficiently with OR
or AND
, but assigning an unknown boolean value to a bitfield takes more instructions to merge the new bits with the bits for other fields. This can significantly bloat code if you assign a variable to a bitfield very often. So using int foo:6
and things like that in your structs, because you know foo
doesn't need the top two bits, is not likely to be helpful. If you're not saving many bits compared to putting each thing in it's own byte/short/int, then the reduction in cache misses won't outweigh the extra instructions (which can add up into I-cache / uop-cache misses, as well as the direct extra latency and work of the instructions.)
The x86 BMI1 / BMI2 (Bit-Manipulation) instruction-set extensions will make copying data from a register into some destination bits (without clobbering the surrounding bits) more efficient. BMI1: Haswell, Piledriver. BMI2: Haswell, Excavator(unreleased). Note that like SSE/AVX, this will mean you'd need BMI versions of your functions, and fallback non-BMI versions for CPUs that don't support those instructions. AFAIK, compilers don't have options to see patterns for these instructions and use them automatically. They're only usable via intrinsics (or asm).
Dbush's answer, packing into bitfields is probably a good choice, depending on how you use your fields. Your fourth option (of packing four separate abcd
values into one struct) is probably a mistake, unless you can do something useful with four sequential abcd
values (vector-style).
For a data structure your code uses extensively, it makes sense to set things up so you can flip from one implementation to another, and benchmark. Nir Friedman's answer, with getters/setters is a good way to go. However, just using int
temporaries and working with the fields as separate members of the struct should work fine. It's up to the compiler to generate code to test the right bits of a byte, for packed bitfields.
If you have any code that checks just one or a couple fields of each struct, esp. looping over sequential struct values, then the struct-of-arrays answer given by cmaster will be useful. x86 vector instructions have a single byte as the smallest granularity, so a struct-of-arrays with each value in a separate byte would let you quickly scan for the first element where a == something
, using PCMPEQB / PTEST
.
Code it with int
s
treat the fields as int
s.
blah.x
in all your code, except the declarion will be all you will be doing. Integral promotion will take care of most cases.
When you are all done, have 3 equivalant include files: an include file using int
s, one using char
and one using bitfields.
And then profile. Don't worry about it at this stage, because its premature optimization, and nothing but your chosen include file will change.
Getting back to the question asked :
used in a tight loop;
their values are read a billion times/s, and that is the bottleneck of the program;
the whole program consists of a big array of billions of Foos;
This is a classic example of when you should write platform specific high performance code that takes time to design for each implementation platform, but the benefits outweigh that cost.
As it's the bottleneck of the entire program you don't look for a general solution, but recognize that this needs to have multiple approaches tested and timed against real data, as the best solution will be platform specific.
It is also possible, as it is a large array of billion of foos, that the OP should consider using OpenCL or OpenMP as potential solutions so as to maximize the exploitation of available resources on the runtime hardware. This is a little dependent on what you need from the data, but it's probably the most important aspect of this type of problem - how to exploit available parallelism.
But there is no single right answer to this question, IMO.