I was trying to measure the difference of using a for
and a foreach
when accessing lists of value types and reference types.
I used the fol
We have the observation that 99.9% of all the floating point operations will involve NaN's, which is at least highly unusual (found by Peter Cordes first). We have another experiment by usr, which found that removing the division instructions makes the time difference almost completely go away.
The fact however is that the NaN's are only generated because the very first division calculates 0.0 / 0.0 which gives the initial NaN. If the divisions are not performed, result will always be 0.0, and we will always calculate 0.0 * temp -> 0.0, 0.0 + temp -> temp, temp - temp = 0.0. So removing the division did not only remove the divisions, but also removed the NaNs. I would expect that the NaN's are actually the problem, and that one implementation handles NaN's very slowly, while the other one doesn't have the problem.
It would be worthwhile starting the loop at i = 1 and measuring again. The four operations result * temp, + temp, / temp, - temp effectively add (1 - temp) so we wouldn't have any unusual numbers (0, infinity, NaN) for most of the operations.
The only problem could be that the division always gives an integer result, and some division implementations have shortcuts when the correct result doesn't use many bits. For example, dividing 310.0 / 31.0 gives 10.0 as the first four bits with a remainder of 0.0, and some implementations can stop evaluating the remaining 50 or so bits while others can't. If there is a significiant difference, then starting the loop with result = 1.0 / 3.0 would make a difference.
valueList[i] = i
, starting from i=0
, so the first loop iteration does 0.0 / 0.0
. So every operation in your entire benchmark is done with NaN
s.
As @usr showed in disassembly output, the 32bit version used x87 floating point, while 64bit used SSE floating point.
I'm not an expert on performance with NaN
s, or the difference between x87 and SSE for this, but I think this explains the 26x perf difference. I bet your results will be a lot closer between 32 and 64bit if you initialize valueList[i] = i+1
. (update: usr confirmed that this made 32 and 64bit performance fairly close.)
Division is very slow compared to other operations. See my comments on @usr's answer. Also see http://agner.org/optimize/ for tons of great stuff about hardware, and optimizing asm and C/C++, some of it relevant to C#. He has instruction tables of latency and throughput for most instructions for all recent x86 CPUs.
However, 10B x87 fdiv
isn't much slower than SSE2's 8B double precision divsd
, for normal values. IDK about perf differences with NaNs, infinities, or denormals.
They have different controls for what happens with NaNs and other FPU exceptions, though. The x87 FPU control word is separate from the SSE rounding / exception control register (MXCSR). If x87 is getting a CPU exception for every division, but SSE isn't, that easily explains the factor of 26. Or maybe there's just a performance difference that big when handling NaNs. The hardware is not optimized for churning through NaN
after NaN
.
IDK if the SSE controls for avoiding slowdowns with denormals will come into play here, since I believe result
will be NaN
all the time. IDK if C# sets the denormals-are-zero flag in the MXCSR, or the flush-to-zero-flag (which writes zeroes in the first place, instead of treating denormals as zero when read back).
I found an Intel article about SSE floating point controls, contrasting it with the x87 FPU control word. It doesn't have much to say about NaN
, though. It ends with this:
Conclusion
To avoid serialization and performance issues due to denormals and underflow numbers, use the SSE and SSE2 instructions to set Flush-to-Zero and Denormals-Are-Zero modes within the hardware to enable highest performance for floating-point applications.
IDK if this helps any with divide-by-zero.
It might be interesting to test a loop body that is throughput-limited, rather than just being one single loop-carried dependency chain. As it is, all of the work depends on previous results; there's nothing for the CPU to do in parallel (other than bounds-check the next array load while the mul/div chain is running).
You might see more difference between the methods if the "real work" occupied more of the CPUs execution resources. Also, on pre-Sandybridge Intel, there's a big difference between a loop fitting in the 28uop loop buffer or not. You get instruction decode bottlenecks if not, esp. when the average instruction length is longer (which happens with SSE). Instructions that decode to more than one uop will also limit decoder throughput, unless they come in a pattern that's nice for the decoders (e.g. 2-1-1). So a loop with more instructions of loop overhead can make the difference between a loop fitting in the 28-entry uop cache or not, which is a big deal on Nehalem, and sometimes helpful on Sandybridge and later.
There may be several reasons why this is executing faster in 64bit on your machine. The reason I asked which CPU you were using was because when 64bit CPUs first made their appearance, AMD and Intel had different mechanisms to handle 64bit code.
Processor architecture:
Intel's CPU architecture was purely 64bit. In order to execute 32bit code, the 32bit instructions needed to be converted (inside the CPU) to 64bit instructions before execution.
AMD's CPU architecture was to build 64bit right on top of their 32bit architecture; that is, it was essentially a 32bit architecture with 64bit extentions - there was no code conversion process.
This was obviously a few years ago now, so I've no idea if/how the technology has changed, but essentially, you would expect 64bit code to perform better on a 64bit machine since the CPU is able to work with double the amount of bits per instruction.
.NET JIT
It's argued that .NET (and other managed languages like Java) are capable of outperforming languages like C++ because of the way the JIT compiler is able to optimize your code according to your processor architecture. In this respect, you might find that the JIT compiler is utilizing something in 64bit architecture that possibly wasn't available or required a workaround when executed in 32bit.
Note:
Rather than using DoubleWrapper, have you considered using Nullable<double>
or shorthand syntax: double?
- I'd be interested to see if that has any impact on your tests.
Note 2: Some people seem to be conflating my comments about 64bit architecture with IA-64. Just to clarify, in my answer, 64bit refers to x86-64 and 32bit refers to x86-32. Nothing here referenced IA-64!
I can reproduce this on 4.5.2. No RyuJIT here. Both x86 and x64 disassemblies look reasonable. Range checks and so on are the same. The same basic structure. No loop unrolling.
x86 uses a different set of float instructions. The performance of these instructions seems to be comparable with the x64 instructions except for the division:
The division operation makes the 32 bit version extremely slow. Uncommenting the division equalizes performance to a large degree (32 bit down from 430ms to 3.25ms).
Peter Cordes points out that the instruction latencies of the two floating point units are not that dissimilar. Maybe some of the intermediate results are denormalized numbers or NaN. These might trigger a slow path in one of the units. Or, maybe the values diverge between the two implementations because of 10 byte vs. 8 byte float precision.
Peter Cordes also points out that all intermediate results are NaN... Removing this problem (valueList.Add(i + 1)
so that no divisor is zero) mostly equalizes the results. Apparently, the 32 bit code does not like NaN operands at all. Let's print some intermediate values: if (i % 1000 == 0) Console.WriteLine(result);
. This confirms that the data is now sane.
When benchmarking you need to benchmark a realistic workload. But who would have thought that an innocent division can mess up your benchmark?!
Try simply summing the numbers to get a better benchmark.
Division and modulo are always very slow. If you modify the BCL Dictionary
code to simply not use the modulo operator to compute the bucket index performance measurable improves. This is how slow division is.
Here's the 32 bit code:
64 bit code (same structure, fast division):
This is not vectorized despite SSE instructions being used.