问题
Basically I am trying to understand why both gcc/clang use xmm
register for their __builtin_memset
even when the memory destination and size are both divisible by sizeof ymm
(or zmm
for that matter) and the CPU supports AVX2
/ AVX512
.
and why GCC implements __builtin_memset
on medium sized values without any SIMD (again assuming CPU supports SIMD).
For example:
__builtin_memset(__builtin_assume_aligned(ptr, 64), -1, 64));
Will compile to:
vpcmpeqd %xmm0, %xmm0, %xmm0
vmovdqa %xmm0, (%rdi)
vmovdqa %xmm0, 16(%rdi)
vmovdqa %xmm0, 32(%rdi)
vmovdqa %xmm0, 48(%rdi)
I am trying to understand why this is chosen as opposed to something like
vpcmpeqd %ymm0, %ymm0, %ymm0
vmovdqa %ymm0, (%rdi)
vmovdqa %ymm0, 32(%rdi)
if you mix the __builtin_memset
with AVX2
instructions they still use xmm
so its definitely not to save the vzeroupper
Second for GCC's __builtin_memset(__builtin_assume_aligned(ptr, 64), -1, 512)
gcc implements it as:
movq $-1, %rdx
xorl %eax, %eax
.L8:
movl %eax, %ecx
addl $32, %eax
movq %rdx, (%rdi,%rcx)
movq %rdx, 8(%rdi,%rcx)
movq %rdx, 16(%rdi,%rcx)
movq %rdx, 24(%rdi,%rcx)
cmpl $512, %eax
jb .L8
ret
Why would gcc choose this over a loop with xmm
(or ymm
/ zmm
) registers?
Here is a godbolt link with the examples (and a few others)
Thank you.
Edit: clang uses ymm (but not zmm)
来源:https://stackoverflow.com/questions/65534658/trying-to-understand-clang-gcc-builtin-memset-on-constant-size-aligned-point