Is there hardware support for 128bit integers in modern processors?

前端 未结 3 1010
孤城傲影
孤城傲影 2020-12-11 00:59

Do we still need to emulate 128bit integers in software, or is there hardware support for them in your average desktop processor these days?

相关标签:
3条回答
  • 2020-12-11 01:26

    The x86-64 instruction set can do 64-bit*64-bit to 128-bit using one instruction (mul for unsigned imul for signed each with one operand) so I would argue that to some degree that the x86 instruction set does include some support for 128-bit integers.

    If your instruction set does not have an instruction to do 64-bit*64-bit to 128-bit then you need several instructions to emulate this.

    This is why 128-bit * 128-bit to lower 128-bit operations can be done with few instructions with x86-64. For example with GCC

    __int128 mul(__int128 a, __int128 b) {
        return a*b;
    }
    

    produces this assembly

    imulq   %rdx, %rsi
    movq    %rdi, %rax
    imulq   %rdi, %rcx
    mulq    %rdx
    addq    %rsi, %rcx
    addq    %rcx, %rdx
    

    which uses one 64-bit * 64-bit to 128-bit instructions, two 64-bit * 64-bit to lower 64-bit instructions, and two 64-bit additions.

    0 讨论(0)
  • 2020-12-11 01:29

    I'm going to explain it by comparing the desktop processors to simple microcontrollers because of the similar operation of the arithmetic logic units (ALU), which are the calculators in the CPU, and the Microsoft x64 Calling Convention vs the System-V Calling Convention. For the short answer scroll to the end, but the long answer is that it's easiest to see the difference by comparing the x86/x64 to ARM and AVR:

    Long Answer

    Native Double Word Integer Math Architecture Support Comparison

    |        CPU        | word x word => dword | dword x dword => dword |
    |:-----------------:|:--------------------:|:----------------------:|
    |        M0         |           No         |           No           |
    |        AVR        |           No         |           No           |
    |      M3/M4/A      |           Yes        |           No           |
    |      x86/x64      |           Yes        |           No           |
    | SSE/SSE2/AVX/AVX2 |           Yes        |           Yes          |
    

    If you understand this chart, skip to Short Answer

    CPUs in smartphones, PCs, and Servers have multiple ALUs that perform calculations on registers of various widths. Microcontrollers on the other hand usually only have one ALU. The word-size of the CPU is not the same as the word size of the ALU, though they may be the same, the Cortex-M0 being a prime example.

    ARM Architecture

    The Cortex-M0 is a Thumb-2 a Von Neuman Architecture Processor, which means it's mostly a 16-bit Thumb16 CPU but it has a 32-bit ALU. In the assembly, you'll have mostly 16-bit instructions and when you have a 32-bit instruction you'll load the word into a 32-bit register and use two 16-bit instructions. This is in stark contrast to the Cortex-M3/M4, both fully-featured 32-bit Harvard Architecture processors. Despite these differences, all ARM CPUs share the same set or architectural registers, which is easy to upgrade from M0 to M3/M4 and faster Cortex-A series smartphone processors with NEON SIMD extentions.

    ARM Architectural Registers

    When performing a binary operation, it is common for the value to overflow a register (i.e. get too large to fit in the register). ALUs have n-bits input and n-bits output with a carryout (i.e. overflow) flag.

    Addition cannot be performed in one instruction but requires relatively few instructions. However, for multiplication you will need to double the word size to fit the result and the ALU only has n inputs and n outputs when you need 2n outputs so that wouldn't work. For example, by multiplying two 32-bit integers you need a 64-bit result and two 64-bit integers require up to a 128-bit result with 4 word-sized registers; 2 is not bad, but 4 gets complicated and you run out of registers. The way the CPU handles this is going to be different. For the Cortex-M0 there are no instructions for that because it's Thumb-2 but with the Cortex-M3/M4 there is an instruction for 32x32=>64-bit register multiply that takes 3 clock cycles.

    AVR Architecture

    The AVR microcontroller has 131 instructions that work on 32 8-bit registers and is classified as an 8-bit processor by instruction count but it has both an 8-bit and a 16-bit ALU. The AVR processor cannot do 16x16=>32-bit calculations with two 16-bit register pairs or 64-bit integer math without a software hack. This is the opposite of the x86/x64 design in both organizations of registers and ALU overflow operation. This is why AVR is classified as an 8/16-bit CPU. Why do you care? It affects performance and interrupt behavior.

    AVR Architectural Registers

    x86 Architecture

    On x86, multiplying two 32-bit integers to create a 64-bit integer can be done with the the MUL instruction resulting in a unsigned 64-bit in EDX:EAX, or 128-bit result in RDX:RAX pair. Multiplying two 64-bit integers on x86 or two 128-bit integers on x64 however is not the same story. Adding 64-bit integers on x86 requires few instructions because the carryout flag from register to register only deals with the LSB or MSB, but 64-bit multiplication requires A LOT of instructions. Here is an example of 32x64=>64-bit x86 signed multiply assembly for x86:

     movl 16(%ebp), %esi    ; get y_l
     movl 12(%ebp), %eax    ; get x_l
     movl %eax, %edx
     sarl $31, %edx         ; get x_h, (x >>a 31), higher 32 bits of sign-extension of x
     movl 20(%ebp), %ecx    ; get y_h
     imull %eax, %ecx       ; compute s: x_l*y_h
     movl %edx, %ebx
     imull %esi, %ebx       ; compute t: x_h*y_l
     addl %ebx, %ecx        ; compute s + t
     mull %esi              ; compute u: x_l*y_l
     leal (%ecx,%edx), %edx ; u_h += (s + t), result is u
     movl 8(%ebp), %ecx
     movl %eax, (%ecx)
     movl %edx, 4(%ecx)
    

    x86 supports pairing up two registers to store the full multiply result (including the high-half), but you can't use the two registers to perform the task of a 64-bit ALU. This is the primary reason why x64 software runs faster than x86 software: you can do the work in a single instruction! You could imagine that 128-bit multiplication in x86 mode would be very computationally expensive, it is. The x64 is very similar to x86 except with twice the number of bits.

    x86 Architectural Registers

    x64 Architectural Registers

    When CPUs pair 2 word-sized registers to create a single double word-sized value, On the stack the resulting double word value will be aligned to a word boundary in RAM. Beyond the two register pair, four-word math is a software hack. This means that for x64 two 64-bit registers may be combined to create a 128-bit register pair overflow that gets aligned to a 64-bit word boundary in RAM, but 128x128=>128-bit math is a software hack.

    The x86/x64, however, is a superscalar CPU, and the registers you know of are merely the architectural registers. Behind the scenes, there are a lot more registers that help optimize the CPU pipeline to perform out of order instructions using multiple ALUs. While the x64 may not be a 128-bit CPU, SSE/SSE2 introduced native 128-bit math, AVX introduced 256-bit native integer math, and AVX2 introduced 512-bit integer math. When returning from functions you will return the value in the 128-bit XMM0 SSE/SSE2 register, 256-bit AVX results in YMM0, and 512-bit AVX2 results in ZMM0; these, however, are add-ons to the x86/x64, not the primary architecture and support is entirely compiler and release platform (such as Python) dependent.

    Short Answer

    The way that C++ applications will handle 128-bit integers will differ based on the Operating System or bare metal calling a convention. Microsoft has their own convention that, much to my own dismay, the resulting 128-bit return value CAN NOT be returned from a function as a single value. The Microsoft x64 Calling Convention dictates that when returning a value, you may return one 64-bit integer or two 32-bit integers. For example, you can do word * word = dword, but in Visual-C++ you must use _umul128 to return the HighProduct, regardless of it being int he RDX:RAX pair. I cried, it was sad. :-( The System-V calling convention, however, does allow for returning 128-bit return types in RAX:RDX. However the CPU architectural registers DO NOT fully support 128-bit integer math, this is a SIMD vector processing extensions that started with SSE/SSE2.

    As for whether you should count on 128-bit integer support, it's extremely rare to come across a user using a 32-bit x86 CPU because they are too slow so it is not best practice to design software to run on 32-bit x86 CPUs because it increases development costs and may lead to a degraded user experience; expect an Athlon 64 or Core 2 Duo to the minimum spec. You can expect the code to not perform as well on Microsoft as Unix OS(s).

    The Intel architecture registers are set in stone, but Intel and AMD are constantly rolling out new architecture extensions but compilers and apps take a long time to update you can't count on it for cross-platform. You'll want to read the Intel 64 and IA-32 Architecture Software Developer’s Manual and AMD64 Programmers Manual.

    0 讨论(0)
  • 2020-12-11 01:37

    Short answer is: NO!

    To elaborate more, the SSE registers are 128-bit wide, but no instructions exist to treat them as 128-bit-integers. At best, these registers are treated as two 64-bit (un)signed integers. Operations like addition/... can be constructed by parallely adding these two 64-bit-values and manually handling overflow, but not with a single instruction. Implementing this can get quite complicated and "ugly", look here:

    How can I add together two SSE registers

    This would have to be done for every basic operation with probably questionable advantages compared to an implemention with 64-bit general purpose registers ("emulation" in software). On the other hand, an advantage of this SSE-approach would be that once it is implemented, it will also work for 256-bit-integers(AVX2) and 512-bit-integers(AVX-512) with only minor modifications.

    0 讨论(0)
提交回复
热议问题