Say, I want to clear 4 zmm registers.
Will the following code provide the fastest speed?
vpxorq zmm0, zmm0, zmm0 vpxorq zmm1, zmm1, zmm1 vpxorq zm
I put together a simple C test program using intrinsics and compiled with ICC 17 - the generated code I get for zeroing 4 zmm registers (at -O3) is:
-O3
vpxord %zmm3, %zmm3, %zmm3 #7.21 vmovaps %zmm3, %zmm2 #8.21 vmovaps %zmm3, %zmm1 #9.21 vmovaps %zmm3, %zmm0 #10.21