Having already read this question I\'m reasonably certain that a given process using floating point arithmatic with the same input (on the same hardware, compiled with the s
In almost any situation where there's a fast mode and a safe mode, you'll find a trade-off of some sort. Otherwise everything would run in fast-safe mode :-).
And, if you're getting different results with the same input, your process is not deterministic, no matter how much you believe it to be (in spite of the empirical evidence).
I would say your explanation is the most likely. Put it in safe mode and see if the non-determinism goes away. That will tell you for sure.
As to whether there are other optimizations, if you're compiling on the same hardware with the same compiler/linker and the same options to those tools, it should generate identical code. I can't see any other possibility other than the fast mode (or bit rot in the memory due to cosmic rays, but that's pretty unlikely).
Following your update:
Intel has a document here which explains some of the things they're not allowed to do in safe mode, including but not limited to:
(a+b)+c -> a+(b+c)
.x + 0 -> x, x * 0 -> 0
.a/b -> a*(1/b)
.While you state that these operations are compile-time defined, the Intel chips are pretty darned clever. They can re-order instructions to keep pipelines full in multi-CPU set-ups so, unless the code specifically prohibits such behavior, things may change at run-time (not compile-time) to keep things going at full speed.
This is covered (briefly) on page 15 of that linked document that talks about vectorization ("Issue: different results re-running the same binary on the same data on the same processor").
My advice would be to decide whether you need raw grunt or total reproducability of results and then choose the mode based on that.
If your program is parallelized, as it might be to run on a quad core, then it may well be non-deterministic.
Imagine that you have 4 processors adding a floating point value to the same memory location. Then you might get
(((InitialValue+P1fp)+P2fp)+P3fp)+P4fp
or
(((InitialValue+P2fp)+P3fp)+P1fp)+P4fp
or any of the other possible orderings.
Heck, you might even get
InitialValue+(P2fp+P3fp)+(P1fp+P4fp)
if the compiler is good enough.
Unfortunately, floating point addition is not commutative or associative. Real number arithmetic is, but floating point is not, because of rounding, overflow, and underflow.
Because of this, parallel FP computation is often non-deterministic. "Often", because programs that look like
on each processor
while( there is work to do ) {
get work
calculate result
add to total
}
will be non-deterministic, because the amount of time that each takes may vary widely - you can't predict the order of operations. (Worse if the threads interact.)
But not always, because there are styles of parallel programming that are deterministic.
Of course, what many folks who care about determinism do is work in integer or fixed point to avoid the problem. I am particularly fond of superaccumulators, 512, 1024, or 2048 bit numbers that floating point numbers can be added to, without suffering rounding errors.
As for a single threaded application: the compiler may rearrange the code. Different compilations may give different answers. But any particular binary should be deterministic.
Unless... you are working in a dynamic language. That performs optimizatuions that reorder the FP computations, that vary over time.
Or unless... really long shot: Itanium had some features, like the ALAT, that made even single threaded coded non-deterministic. You are unlikely to be affected by these.