TL;DR: Why is multiplying/casting data in size_t
slow and why does this vary per platform?
I\'m having some performance issues that I don\'
For your original questions:
For the additional questions:
In general try to avoid visible and hidden casts as good as possible if these aren't really necessary. For example try to find out what real datatype is hidden behind "size_t" on your environment (gcc) and use that one for the loop-variable. In your example the square of uint's cannot be a float datatype so it makes no sense to use double here. Stick to integer types to achieve maximum performance.
On x86, the conversion of uint64_t
to floating point is slower because there are only instructions to convert int64_t
, int32_t
and int16_t
. int16_t
and in 32-bit mode int64_t
can only be converted using x87 instructions, not SSE.
When converting uint64_t
to floating point, GCC 4.2.1 first converts the value as if it were an int64_t
and then adds 264 if it was negative to compensate. (When using the x87, on Windows and *BSD or if you changed the precision control, beware that the conversion ignores precision control but the addition respects it.)
An uint32_t
is first extended to int64_t
.
When converting 64-bit integers in 32-bit mode on processors with certain 64-bit capabilities, a store-to-load forwarding issue may cause stalls. The 64-bit integer is written as two 32-bit values and read back as one 64-bit value. This can be very bad if the conversion is part of a long dependency chain (not in this case).