Often I find myself having to represent a structure that consists of very small values. For example, Foo
has 4 values, a, b, c, d
that, range from
First, precisely define what you mean by "most efficient". Best memory utilization? Best performance?
Then implement your algorithm both ways and actually profile it on the actual hardware you intend to run it on under the actual conditions you intend to run it under once it's delivered.
Pick the one that better meets your original definition of "most efficient".
Anything else is just a guess. Whatever you choose will probably work fine, but without actually measuring the difference under the exact conditions you'd use the software, you'll never know which implementation would be "more efficient".
- the whole program consists of a big array of billions of Foos;
First things first, for #2, you might find yourself or your users (if others run the software) often being unable to allocate this array successfully if it spans gigabytes. A common mistake here is to think that out of memory errors mean "no more memory available", when they instead often mean that the OS could not find a contiguous set of unused pages matching the requested memory size. It's for this reason that people often get confused when they request to allocate a one gigabyte block only to have it fail even though they have 30 gigabytes of physical memory free, e.g. Once you start allocating memory in sizes that span more than, say, 1% of the typical amount of memory available, it's often time to consider avoiding one giant array to represent the whole thing.
So perhaps the first thing you need to do is rethink the data structure. Instead of allocating a single array of billions of elements, often you'll significantly reduce the odds of running into problems by allocating in smaller chunks (smaller arrays aggregated together). For example, if your access pattern is solely sequential in nature, you can use an unrolled list (arrays linked together). If random access is needed, you might use something like an array of pointers to arrays which each span 4 kilobytes. This requires a bit more work to index an element, but with this kind of scale of billions of elements, it's often a necessity.
One of the things unspecified in the question are the memory access patterns. This part is critical for guiding your decisions.
For example, is the data structure solely traversed sequentially, or is random access needed? Are all of these fields: a
, b
, c
, d
, needed together all the time, or can they be accessed one or two or three at a time?
Let's try to cover all the possibilities. At the scale we're talking about, this:
struct Foo {
int a1;
int b1;
int c1;
int d1
};
... is unlikely to be helpful. At this kind of input scale, and accessed in tight loops, your times are generally going to be dominated by the upper levels of memory hierarchy (paging and CPU cache). It no longer becomes quite as critical to focus on the lowest level of the hierarchy (registers and associated instructions). To put it another way, at billions of elements to process, the last thing you should be worrying about is the cost of moving this memory from L1 cache lines to registers and the cost of bitwise instructions, e.g. (not saying it's not a concern at all, just saying it's a much lower priority).
At a small enough scale where the entirety of the hot data fits into the CPU cache and a need for random access, this kind of straightforward representation can show a performance improvement due to the improvements at the lowest level of the hierarchy (registers and instructions), yet it would require a drastically smaller-scale input than what we're talking about.
So even this is likely to be a considerable improvement:
struct Foo {
char a1;
char b1;
char c1;
char d1;
};
... and this even more:
// Each field packs 4 values with 2-bits each.
struct Foo {
char a4;
char b4;
char c4;
char d4;
};
* Note that you could use bitfields for the above, but bitfields tend to have caveats associated with them depending on the compiler being used. I've often been careful to avoid them due to the portability issues commonly described, though this may be unnecessary in your case. However, as we adventure into SoA and hot/cold field-splitting territories below, we'll reach a point where bitfields can't be used anyway.
This code also places a focus on horizontal logic which can start to make it easier to explore some further optimization paths (ex: transforming the code to use SIMD), as it's already in a miniature SoA form.
Especially at this kind of scale, and even more so when your memory access is sequential in nature, it helps to think in terms of data "consumption" (how quickly the machine can load data, do the necessary arithmetic, and store the results). A simple mental image I find useful is to imagine the computer as having a "big mouth". It goes faster if we feed it large enough spoonfuls of data at once, not little teeny teaspoons, and with more relevant data packed tightly into a contiguous spoonful.
The above code so far is making the assumption that all of these fields are equally hot (accessed frequently), and accessed together. You may have some cold fields or fields that are only accessed in critical code paths in pairs. Let's say that you rarely access c
and d
, or that your code has one critical loop that accesses a
and b
, and another that accesses c
and d
. In that case, it can be helpful to split it off into two structures:
struct Foo1 {
char a4;
char b4;
};
struct Foo2 {
char c4;
char d4;
};
Again if we're "feeding" the computer data, and our code is only interested in a
and b
fields at the moment, we can pack more into spoonfuls of a
and b
fields if we have contiguous blocks that only contain a
and b
fields, and not c
and d
fields. In such a case, c
and d
fields would be data the computer can't digest at the moment, yet it would be mixed into the memory regions in between a
and b
fields. If we want the computer to consume data as quickly as possible, we should only be feeding it the relevant data of interest at the moment, so it's worth splitting the structures in these scenarios.
Moving towards vectorization, and assuming sequential access, the fastest rate at which the computer can consume data will often be in parallel using SIMD. In such a case, we might end up with a representation like this:
struct Foo1 {
char* a4n;
char* b4n;
};
... with careful attention to alignment and padding (the size/alignment should be a multiple of 16 or 32 bytes for AVX or even 64 for futuristic AVX-512) necessary to use faster aligned moves into XMM/YMM registers (and possibly with AVX instructions in the future).
Unfortunately the above representation can start to lose a lot of the potential benefits if a
and b
are accessed frequently together, especially with a random access pattern. In such a case, a more optimal representation can start looking like this:
struct Foo1 {
char a4x32[32];
char b4x32[32];
};
... where we're now aggregating this structure. This makes it so the a
and b
fields are no longer so spread apart, allowing groups of 32 a
and b
fields to fit into a single 64-byte cache line and accessed together quickly. We can also fit 128 or 256 a
or b
elements now into an XMM/YMM register.
Normally I try to avoid general wisdom advice in performance questions, but I noticed this one seems to avoid the details that someone who has profiler in hand would typically mention. So I apologize if this comes off a bit as patronizing or if a profiler is already being actively used, but I think the question warrants this section.
As an anecdote, I've often done a better job (I shouldn't!) at optimizing production code written by people who have far superior knowledge than me about computer architecture (I worked with a lot of people who came from the punch card era and can understand assembly code at a glance), and would often get called in to optimize their code (which felt really odd). It's for one simple reason: I "cheated" and used a profiler (VTune). My peers often didn't (they had an allergy to it and thought they understood hotspots just as well as a profiler and saw profiling as a waste of time).
Of course the ideal is to find someone who has both the computer architecture expertise and a profiler in hand, but lacking one or the other, the profiler can give the bigger edge. Optimization still rewards a productivity mindset which hinges on the most effective prioritization, and the most effective prioritization is to optimize the parts that truly matter the most. The profiler gives us detailed breakdowns of exactly how much time is spent and where, along with useful metrics like cache misses and branch mispredictions which even the most advanced humans typically can't predict anywhere close to as accurate as a profiler can reveal. Furthermore, profiling is often the key to discovering how the computer architecture works at a more rapid pace by chasing down hotspots and researching why they exist. For me, profiling was the ultimate entry point into better understanding how the computer architecture actually works and not how I imagined it to work. It was only then that the writings of someone as experienced in this regard as Mysticial
started to make more and more sense.
One of the things that might start to become apparent here is that there are many optimization possibilities. The answers to this kind of question are going to be about strategies rather than absolute approaches. A lot still has to be discovered in hindsight after you try something, and still iterating towards more and more optimal solutions as you need them.
One of the difficulties here in a complex codebase is leaving enough breathing room in the interfaces to experiment and try different optimization techniques, to iterate and iterate towards faster solutions. If the interface leaves room to seek these kinds of optimizations, then we can optimize all day long and often get some marvelous results if we're measuring things properly even with a trial and error mindset.
To often leave enough breathing room in an implementation to even experiment and explore faster techniques often requires the interface designs to accept data in bulk. This is especially true if the interfaces involve indirect function calls (ex: through a dylib or a function pointer) where inlining is no longer an effective possibility. In such scenarios, leaving room to optimize without cascading interface breakages often means designing away from the mindset of receiving simple scalar parameters in favor of passing pointers to whole chunks of data (possibly with a stride if there are various interleaving possibilities). So while this is straying into a pretty broad territory, a lot of the top priorities in optimizing here are going to boil down to leaving enough breathing room to optimize implementations without cascading changes throughout your codebase, and having a profiler in hand to guide you the right way.
Anyway, some of these strategies should help guide you the right way. There are no absolutes here, only guides and things to try out, and always best done with a profiler in hand. Yet when processing data of this enormous scale, it's always worth remembering the image of the hungry monster, and how to most effectively feed it these appropriately-sized and packed spoonfuls of relevant data.
Pack them only if space is a consideration - for example, an array of 1,000,000 structs. Otherwise, the code needed to do shifting and masking is greater than the savings in space for the data. Hence you are more likely to have a cache miss on the I-cache than the D-cache.
If what you're after is efficiency of space, then you should consider avoiding struct
s altogether. The compiler will insert padding into your struct representation as necessary to make its size a multiple of its alignment requirement, which might be as much as 16 bytes (but is more likely to be 4 or 8 bytes, and could after all be as little as 1 byte).
If you use a struct anyway, then which to use depends on your implementation. If @dbush's bitfield approach yields one-byte structures then it's hard to beat that. If your implementation is going to pad the representation to at least four bytes no matter what, however, then this is probably the one to use:
struct Foo {
char a;
char b;
char c;
char d;
};
Or I guess I would probably use this variant:
struct Foo {
uint8_t a;
uint8_t b;
uint8_t c;
uint8_t d;
};
Since we're supposing that your struct is taking up a minimum of four bytes, there is no point in packing the data into smaller space. That would be counter-productive, in fact, because it would also make the processor do the extra work packing and unpacking the values within.
For handling large amounts of data, making efficient use of the CPU cache provides a far greater win than avoiding a few integer operations. If your data usage pattern is at least somewhat systematic (e.g. if after accessing one element of your erstwhile struct array, you are likely to access a nearby one next) then you are likely to get a boost in both space efficiency and speed by packing the data as tightly as you can. Depending on your C implementation (or if you want to avoid implementation dependency), you might need to achieve that differently -- for instance, via an array of integers. For your particular example of four fields, each requiring two bits, I would consider representing each "struct" as a uint8_t
instead, for a total of 1 byte each.
Maybe something like this:
#include <stdint.h>
#define NUMBER_OF_FOOS 1000000000
#define A 0
#define B 2
#define C 4
#define D 6
#define SET_FOO_FIELD(foos, index, field, value) \
((foos)[index] = (((foos)[index] & ~(3 << (field))) | (((value) & 3) << (field))))
#define GET_FOO_FIELD(foos, index, field) (((foos)[index] >> (field)) & 3)
typedef uint8_t foo;
foo all_the_foos[NUMBER_OF_FOOS];
The field name macros and access macros provide a more legible -- and adjustable -- way to access the individual fields than would direct manipulation of the array (but be aware that these particular macros evaluate some of their arguments more than once). Every bit is used, giving you about as good cache usage as it is possible to achieve through choice of data structure alone.
I did video decompression for a while. The fastest thing to do is something like this:
short ABCD; //use a 16 bit data type for your example
and set up some macros. Maybe:
#define GETA ((ABCD >> 12) & 0x000F)
#define GETB ((ABCD >> 8) & 0x000F)
#define GETC ((ABCD >> 4) & 0x000F)
#define GETD (ABCD & 0x000F) // no need to shift D
In practice you should try to be moving 32 bit longs or 64 bit long long because thats the native MOVE size on most modern processors.
Using a struct will always create the overhead in your compiled code of extra instructions from the base address of you struct to the field. So get away from that if you really want to tighten your loop.
Edit: Above example gives you 4 bit values. If you really just need values of 0..3 then you can do the same things to pull out your 2 bit numbers so,,,GETA might look like this:
GETA ((ABCD >> 14) & 0x0003)
And if you are really moving billions of things things, and I don't doubt it, just fill up a 32bit variable and shift and mask your way through it.
Hope this helps.
For dense packing that doesn't incur a large overhead of reading, I'd recommend a struct with bitfields. In your example where you have four values ranging from 0 to 3, you'd define the struct as follows:
struct Foo {
unsigned char a:2;
unsigned char b:2;
unsigned char c:2;
unsigned char d:2;
}
This has a size of 1 byte, and the fields can be accessed simply, i.e. foo.a
, foo.b
, etc.
By making your struct more densely packed, that should help with cache efficiency.
Edit:
To summarize the comments:
There's still bit fiddling happening with a bitfield, however it's done by the compiler and will most likely be more efficient than what you would write by hand (not to mention it makes your source code more concise and less prone to introducing bugs). And given the large amount of structs you'll be dealing with, the reduction of cache misses gained by using a packed struct such as this will likely make up for the overhead of bit manipulation the struct imposes.