Often I find myself having to represent a structure that consists of very small values. For example, Foo
has 4 values, a, b, c, d
that, range from
Getting back to the question asked :
used in a tight loop;
their values are read a billion times/s, and that is the bottleneck of the program;
the whole program consists of a big array of billions of Foos;
This is a classic example of when you should write platform specific high performance code that takes time to design for each implementation platform, but the benefits outweigh that cost.
As it's the bottleneck of the entire program you don't look for a general solution, but recognize that this needs to have multiple approaches tested and timed against real data, as the best solution will be platform specific.
It is also possible, as it is a large array of billion of foos, that the OP should consider using OpenCL or OpenMP as potential solutions so as to maximize the exploitation of available resources on the runtime hardware. This is a little dependent on what you need from the data, but it's probably the most important aspect of this type of problem - how to exploit available parallelism.
But there is no single right answer to this question, IMO.