Why doesn't my processor have built-in BigInt support?

后端 未结 8 1904
执笔经年
执笔经年 2021-02-20 13:20

As far as I understood it, BigInts are usually implemented in most programming languages as arrays containing digits, where, eg.: when adding two of them, each digit is added on

8条回答
  •  名媛妹妹
    2021-02-20 13:50

    It would work like any other BigInt library, only (a lot) faster and at a lower level: Processor fetches one digit from the cache/RAM, adds it, and writes the result back again.

    Almost all CPUs do have this built-in. You have to use a software loop around the relevant instructions, but that doesn't make it slower if the loop is efficient. (That's non-trivial on x86, due to partial-flag stalls, see below)

    e.g. if x86 provided rep adc to do src += dst, taking 2 pointers and a length as input (like rep movsd to memcpy), it would still be implemented as a loop in microcode.

    It would be possible for a 32bit x86 CPU to have an internal implementation of rep adc that used 64bit adds internally, since 32bit CPUs probably still have a 64bit adder. However, 64bit CPUs probably don't have a single-cycle latency 128b adder. So I don't expect that having a special instruction for this would give a speedup over what you can do with software, at least on a 64bit CPU.

    Maybe a special wide-add instruction would be useful on a low-power low-clock-speed CPU where a really wide adder with single-cycle latency is possible.


    The x86 instructions you're looking for are:

    • adc: add with carry / sbb: subtract with borrow
    • mul: full multiply, producing upper and lower halves of the result: e.g. 64b*64b => 128b
    • div: dividend is twice as wide as the other operands, e.g. 128b / 64b => 64b division.

    Of course, adc works on binary integers, not single decimal digits. x86 can adc in 8, 16, 32, or 64bit chunks, unlike RISC CPUs which typically only adc at full register width. (GMP calls each chunk a "limb"). (x86 has some instructions for working with BCD or ASCII, but those instructions were dropped for x86-64.)

    imul / idiv are the signed equivalents. Add works the same for signed 2's complement as for unsigned, so there's no separate instruction; just look at the relevant flags to detect signed vs. unsigned overflow. But for adc, remember that only the most-significant chunk has the sign bit; the rest are essential unsigned.

    ADX and BMI/BMI2 add some instructions like mulx: full-multiply without touching flags, so it can be interleaved with an adc chain to create more instruction-level parallelism for superscalar CPUs to exploit.

    In x86, adc is even available with a memory destination, so it performs exactly like you describe: one instruction triggers the whole read / modify / write of a chunk of the BigInteger. See example below.


    Most high-level languages (including C/C++) don't expose a "carry" flag

    Usually there aren't intrinsics add-with-carry directly in C. BigInteger libraries usually have to be written in asm for good performance.

    However, Intel actually has defined intrinsics for adc (and adcx / adox).

    unsigned char _addcarry_u64 (unsigned char c_in, unsigned __int64 a, \
                                 unsigned __int64 b, unsigned __int64 * out);
    

    So the carry result is handled as an unsigned char in C. For the _addcarryx_u64 intrinsic, it's up to the compiler to analyse the dependency chains and decide which adds to do with adcx and which to do with adox, and how to string them together to implement the C source.

    IDK what the point of _addcarryx intrinsics are, instead of just having the compiler use adcx/adox for the existing _addcarry_u64 intrinsic, when there are parallel dep chains that can take advantage of it. Maybe some compilers aren't smart enough for that.


    Here's an example of a BigInteger add function, in NASM syntax:

    ;;;;;;;;;;;; UNTESTED ;;;;;;;;;;;;
    ; C prototype:
    ; void bigint_add(uint64_t *dst, uint64_t *src, size_t len);
    ;   len is an element-count, not byte-count
    
    global bigint_add
    bigint_add:   ; AMD64 SysV ABI: dst=rdi, src=rsi, len=rdx
    
                                  ; set up for using dst as an index for src
        sub    rsi, rdi           ;  rsi -= dst.  So orig_src = rsi + rdi
    
        clc                           ;  CF=0 to set up for the first adc
               ; alternative: peel the first iteration and use add instead of adc
    
    .loop:
        mov    rax, [rsi + rdi]   ; load from src
        adc    rax, [rdi]         ;  <================= ADC with dst
        mov    [rdi], rax         ; store back into dst.  This appears to be cheaper than  adc  [rdi], rax  since we're using a non-indexed addressing mode that can micro-fuse
    
        lea    rdi,  [rdi + 8]    ; pointer-increment without clobbering CF
        dec    rdx                ; preserves CF
        jnz    .loop              ; loop while(--len)
    
        ret
    

    On older CPUs, especially pre-Sandybridge, adc will cause a partial-flag stall when reading CF after dec writes other flags. Looping with a different instruction will help for old CPUs which stall while merging partial-flag writes, but not be worth it on SnB-family.

    Loop unrolling is also very important for adc loops. adc decodes to multiple uops on Intel, so loop overhead is a problem, esp if you have extra loop overhead from avoiding partial-flag stalls. If len is a small known constant, a fully-unrolled loop is usually good. (e.g. compilers just use add/adc to do a uint128_t on x86-64.)

    adc with a memory destination appears not to be the most efficient way, since the pointer-difference trick lets us use a single-register addressing mode for dst. (Without that trick, memory-operands wouldn't micro-fuse).

    According to Agner Fog's instruction tables for Haswell and Skylake, adc r,m is 2 uops (fused-domain) with one per 1 clock throughput, while adc m, r/i is 4 uops (fused-domain), with one per 2 clocks throughput. Apparently it doesn't help that Broadwell/Skylake run adc r,r/i as a single-uop instruction (taking advantage of ability to have uops with 3 input dependencies, introduced with Haswell for FMA). I'm also not 100% sure Agner's results are right here, since he didn't realize that SnB-family CPUs only micro-fuse indexed addressing modes in the decoders / uop-cache, not in the out-of-order core.

    Anyway, this simple not-unrolled-at-all loop is 6 uops, and should run at one iteration per 2 cycles on Intel SnB-family CPUs. Even if it takes an extra uop for partial-flag merging, that's still easily less than the 8 fused-domain uops that can be issued in 2 cycles.

    Some minor unrolling could get this close to 1 adc per cycle, since that part is only 4 uops. However, 2 loads and one store per cycle isn't quite sustainable.


    Extended-precision multiply and divide are also possible, taking advantage of the widening / narrowing multiply and divide instructions. It's much more complicated, of course, due to the nature of multiplication.


    It's not really helpful to use SSE for add-with carry, or AFAIK any other BigInteger operations.

    If you're designing a new instruction-set, you can do BigInteger adds in vector registers if you have the right instructions to efficiently generate and propagate carry. That thread has some back-and-forth discussion on the costs and benefits of supporting carry flags in hardware, vs. having software generate carry-out like MIPS does: compare to detect unsigned wraparound, putting the result in another integer register.

提交回复
热议问题