What is the most efficient way to clear a single or a few ZMM registers on Knights Landing?

后端 未结 3 1647
旧时难觅i
旧时难觅i 2021-01-12 07:07

Say, I want to clear 4 zmm registers.

Will the following code provide the fastest speed?

vpxorq  zmm0, zmm0, zmm0
vpxorq  zmm1, zmm1, zmm1
vpxorq  zm         


        
3条回答
  •  栀梦
    栀梦 (楼主)
    2021-01-12 07:38

    Following Paul R's advice of looking to see what code compilers generate, we see that ICC uses VPXORD to zero-out one ZMM register, then VMOVAPS to copy this zeroed XMM register to any additional registers that need to be zeroed. In other words:

    vpxord    zmm3, zmm3, zmm3
    vmovaps   zmm2, zmm3
    vmovaps   zmm1, zmm3
    vmovaps   zmm0, zmm3
    

    GCC does essentially the same thing, but uses VMOVDQA64 for ZMM-ZMM register moves:

    vpxord      zmm3, zmm3, zmm3
    vmovdqa64   zmm2, zmm3
    vmovdqa64   zmm1, zmm3
    vmovdqa64   zmm0, zmm3
    

    GCC also tries to schedule other instructions in-between the VPXORD and the VMOVDQA64. ICC doesn't exhibit this preference.

    Clang uses VPXORD to zero all of the ZMM registers independently, a la:

    vpxord  zmm0, zmm0, zmm0
    vpxord  zmm1, zmm1, zmm1
    vpxord  zmm2, zmm2, zmm2
    vpxord  zmm3, zmm3, zmm3
    

    The above strategies are followed by all versions of the indicated compilers that support generation of AVX-512 instructions, and don't appear to be affected by requests to tune for a particular microarchitecture.


    This pretty strongly suggests that VPXORD is the instruction you should be using to clear a 512-bit ZMM register.

    Why VPXORD instead of VPXORQ? Well, you only care about the size difference when you're masking, so if you're just zeroing a register, it really doesn't matter. Both are 6-byte instructions, and according to Agner Fog's instruction tables, on Knights Landing:

    • Both execute on the same number of ports (FP0 or FP1),
    • Both decode to 1 µop
    • Both have a minimum latency of 2, and a reciprocal throughput of 0.5.
      (Note that this last bullet highlights a major disadvantage of KNL—all vector instructions have a latency of at least 2 clock cycles, even the simple ones that have 1-cycle latencies on other microarchitectures.)

    There's no clear winner, but compilers seem to prefer VPXORD, so I'd stick with that one, too.

    What about VPXORD/VPXORQ vs. VXORPS/VXORPD? Well, as you mention in the question, packed-integer instructions can generally execute on more ports than their floating-point counterparts, at least on Intel CPUs, making the former preferable. However, that isn't the case on Knights Landing. Whether packed-integer or floating-point, all logical instructions can execute on either FP0 or FP1, and have identical latencies and throughput, so you should theoretically be able to use either. Also, since both forms of instructions execute on the floating-point units, there is no domain-crossing penalty (forwarding delay) for mixing them like you would see on other microarchitectures. My verdict? Stick with the integer form. It isn't a pessimization on KNL, and it's a win when optimizing for other architectures, so be consistent. It's less you have to remember. Optimizing is hard enough as it is.

    Incidentally, the same is true when it comes to deciding between VMOVAPS and VMOVDQA64. They are both 6-byte instructions, they both have the same latency and throughput, they both execute on the same ports, and there are no bypass delays that you have to be concerned with. For all practical purposes, these can be seen as equivalent when targeting Knights Landing.

    And finally, you asked whether "the CPU [is] smart enough not to make false dependencies on the previous values of the ZMM registers when [you] clear them with VPXORD/VPXORQ". Well, I don't know for sure, but I imagine so. XORing a register with itself to clear it has been an established idiom for a long time, and it is known to be recognized by other Intel CPUs, so I can't imagine why it wouldn't be on KNL. But even if it's not, this is still the most optimal way to clear a register.

    The alternative would be something like moving in a 0 value from memory, which is not only a substantially longer instruction to encode but also requires you to pay a memory-access penalty. This isn't going to be a win…unless maybe you were throughput-bound, since VMOVAPS with a memory operand executes on a different unit (a dedicated memory unit, rather than either of the floating-point units). You'd need a pretty compelling benchmark to justify that kind of optimization decision, though. It certainly isn't a "general purpose" strategy.

    Or maybe you could do a subtraction of the register with itself? But I doubt this would be any more likely to be recognized as dependency-free than XOR, and everything else about the execution characteristics will be the same, so that's not a compelling reason to break from the standard idiom.

    In both of these cases, the practicality factor comes into play. When push comes to shove, you have to write code for other humans to read and maintain. Since it's going to cause everyone forever after who reads your code to stumble, you'd better have a really compelling reason for doing something odd.


    Next question: should we repeatedly issue VPXORD instructions, or should we copy one zeroed register into the others?

    Well, VPXORD and VMOVAPS have equivalent latencies and throughputs, decode to the same number of µops, and can execute on the same number of ports. From that perspective, it doesn't matter.

    What about data dependencies? Naïvely, one might assume that repeated XORing is better, since the move depends on the initial XOR. Perhaps this is why Clang prefers repeated XORing, and why GCC prefers to schedule other instructions in-between the XOR and MOV. If I were writing the code quickly, without doing any research, I'd probably write it the way Clang does. But I can't say for sure whether this is the most optimal approach without benchmarks. And with neither of us having access to a Knights Landing processor, these aren't going to be easy to come by. :-)

    Intel's Software Developer Emulator does support AVX-512, but it's unclear whether this is a cycle-exact simulator that would be suitable for benchmarking/optimization decisions. This document simultaneously suggests both that it is ("Intel SDE is useful for performance analysis, compiler development tuning, and application development of libraries.") and that it is not ("Please note that Intel SDE is a software emulator and is mainly used for emulating future instructions. It is not cycle accurate and can be very slow (up-to 100x). It is not a performance-accurate emulator."). What we need is a version of IACA that supports Knights Landing, but alas, that has not been forthcoming.


    In summary, it's nice to see that three of the most popular compilers generate high-quality, efficient code even for such a new architecture. They make slightly different decisions in which instructions to prefer, but this makes little to no practical difference.

    In many ways, we've seen that this is because of unique aspects of the Knights Landing microarchitecture. In particular, the fact that most vector instructions execute on either of two floating-point units, and that they have identical latencies and throughputs, with the implication being that there are no domain-crossing penalties you need to be concerned with and you there's no particular benefit in preferring packed-integer instructions over floating-point instructions. You can see this in the core diagram (the orange blocks on the left are the two vector units):

    Use whichever sequence of instructions you like the best.

提交回复
热议问题