I often see code that converts ints to doubles to ints to doubles and back once again (sometimes for good reasons, sometimes not), and it just occurred to me that this seems
Here's what I could dig up myself, for x86-64 doing FP math with SSE2 (not legacy x87 where changing the rounding mode for C++'s truncation semantics was expensive):
When I take a look at the generated assembly from clang and gcc, it looks like the cast int
to double
, it boils down to one instruction: cvttsd2si
.
From double
to int
it's cvtsi2sd
. (cvtsi2sdl
AT&T syntax for cvtsi2sd
with 32-bit operand-size.)
With auto-vectorization, we get cvtdq2pd
.
So I suppose the question becomes: what is the cost of those?
These instructions each cost approximately the same as an FP addsd
plus a movq xmm, r64
(fp <- integer) or movq r64, xmm
(integer <- fp), because they decode to 2 uops which on the same ports, on mainstream (Sandybridge/Haswell/Sklake) Intel CPUs.
The Intel® 64 and IA-32 Architectures Optimization Reference Manual says that cost of the cvttsd2si
instruction is 5 latency (see Appendix C-16). cvtsi2sd
, depending on your architecture, has latency varying from 1 on Silvermont to more like 7-16 on several other architectures.
Agner Fog's instruction tables have more accurate/sensible numbers, like 5-cycle latency for cvtsi2sd
on Silvermont (with 1 per 2 clock throughput), or 4c latency on Haswell, with one per clock throughput (if you avoid the dependency on the destination register from merging with the old upper half, like gcc usually does with pxor xmm0,xmm0
).
SIMD packed-float
to packed-int
is great; single uop. But converting to double
requires a shuffle to change element size. SIMD float/double<->int64_t doesn't exist until AVX512, but can be done manually with limited range.
Intel's manual defines latency as: "The number of clock cycles that are required for the execution core to complete the execution of all of the μops that form an instruction." But a more useful definition is the number of clocks from an input being ready until the output becomes ready. Throughput is more important than latency if there's enough parallelism for out-of-order execution to do its job: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?.
The same Intel manual says that an integer add
instruction costs 1 latency and an integer imul
costs 3 (Appendix C-27). FP addsd
and mulsd
run at 2 per clock throughput, with 4 cycle latency, on Skylake. Same for the SIMD versions, and for FMA, with 128 or 256-bit vectors.
On Haswell, addsd
/ addpd
is only 1 per clock throughput, but 3 cycle latency thanks to a dedicated FP-add unit.
So, the answer boils down to:
1) It's hardware optimized, and the compiler leverages the hardware machinery.
2) It costs only a bit more than a multiply does in terms of the # of cycles in one direction, and a highly variable amount in the other (depending on your architecture). Its cost is neither free nor absurd, but probably warrants more attention given how easy it is write code that incurs the cost in a non-obvious way.
Of course this kind of question depends on the exact hardware and even on the mode.
On x86 my i7 when used in 32-bit mode with default options (gcc -m32 -O3
) the conversion from int
to double
is quite fast, the opposite instead is much slower because the C standard mandates an absurd rule (truncation of decimals).
This way of rounding is bad both for math and for hardware and requires the FPU to switch to this special rounding mode, perform the truncation, and switch back to a sane way of rounding.
If you need speed doing the float->int conversion using the simple fistp
instruction is faster and also much better for computation results, but requires some inline assembly.
inline int my_int(double x)
{
int r;
asm ("fldl %1\n"
"fistpl %0\n"
:"=m"(r)
:"m"(x));
return r;
}
is more than 6 times faster than naive x = (int)y;
conversion (and doesn't have a bias toward 0).
The very same processor, when used in 64-bit mode however has no speed problems and using the fistp
code actually makes the code run somewhat slower.
Apparently the hardware guys gave up and implemented the bad rounding algorithm directly in hardware (so badly rounding code can now run fast).