I\'m testing Intel ADX add with carry and add with overflow to pipeline adds on large integers. I\'d like to see what expected code generation should look like. From _addcarry_u
This does look like a good test-case. It assembles to correct working code, right? It's useful for a compiler to support the intrinsic in that sense, even if it doesn't yet support making optimal code. It lets people start using the intrinsic. This is necessary for compatibility.
Next year or whenever the compiler's backend support for adcx/adox is done, the same code will compile to faster binaries with no source modification.
I assume that's what's going on for gcc.
clang 3.8.1's implementation is more literal, but it ends up doing a terrible job: flag-saving with sahf and push/pop of eax. See it on Godbolt.
I think there's even a bug in the asm source output, since mov eax, ch
won't assemble. (Unlike gcc, clang/LLVM uses a built-in assembler and doesn't actually go through a text representation of asm on the way from LLVM IR to machine code). The disassembly of the machine code shows mov eax,ebp
there. I think that's also a bug, because bpl
(or the rest of the register) doesn't have a useful value at that point. Probably it wanted mov al, ch
or movzx eax, ch
.
When GCC will be fixed to generate much better inlined code for add_carryx_... , be careful with your code, because the loop variant contains a comparison (modifies the C and O flags similarly to sub instruction) and an increment (modifies the C and O flags like an add instruction).
for(unsigned int i=0; i< MAX_ARRAY; i++){
c1 = _addcarryx_u64(c1, res[i], a[i], (unsigned long long int*)&res[i]);
c2 = _addcarryx_u64(c2, res[i], b[i], (unsigned long long int*)&res[i]);
}
For that reason, c1 and c2 in your code will always be pitifuly handled (saved and restored in temp registers at each loop iteration). And the resulting code generated by gcc will still look like the assembly you provided, for good reasons.
From a run-time point of view, res[i] is an immediate dependency between the 2 add_carryx instructions, the 2 instructions are not really independent and will not benefit from a possible architectural parallelism in the processor.
I understand the code is only an example, but maybe it will not be the best example to use when gcc will be modified.
The addition of 3 numbers in large integer arithmetic is a tough problem; vectorization helps, and then you better use addcarryx to handle the loop variants in parallel (increment and comparison+branch on the same variable, yet another tough problem).