Trying to understand clang/gcc __builtin_memset on constant size / aligned pointers

问题

Basically I am trying to understand why both gcc/clang use xmm register for their __builtin_memset even when the memory destination and size are both divisible by sizeof ymm (or zmm for that matter) and the CPU supports AVX2 / AVX512.

and why GCC implements __builtin_memset on medium sized values without any SIMD (again assuming CPU supports SIMD).

For example:

__builtin_memset(__builtin_assume_aligned(ptr, 64), -1, 64));

Will compile to:

        vpcmpeqd        %xmm0, %xmm0, %xmm0
        vmovdqa %xmm0, (%rdi)
        vmovdqa %xmm0, 16(%rdi)
        vmovdqa %xmm0, 32(%rdi)
        vmovdqa %xmm0, 48(%rdi)

I am trying to understand why this is chosen as opposed to something like

        vpcmpeqd        %ymm0, %ymm0, %ymm0
        vmovdqa %ymm0, (%rdi)
        vmovdqa %ymm0, 32(%rdi)

if you mix the __builtin_memset with AVX2 instructions they still use xmm so its definitely not to save the vzeroupper

Second for GCC's __builtin_memset(__builtin_assume_aligned(ptr, 64), -1, 512) gcc implements it as:

        movq    $-1, %rdx
        xorl    %eax, %eax
.L8:
        movl    %eax, %ecx
        addl    $32, %eax
        movq    %rdx, (%rdi,%rcx)
        movq    %rdx, 8(%rdi,%rcx)
        movq    %rdx, 16(%rdi,%rcx)
        movq    %rdx, 24(%rdi,%rcx)
        cmpl    $512, %eax
        jb      .L8
        ret

Why would gcc choose this over a loop with xmm (or ymm / zmm) registers?

Here is a godbolt link with the examples (and a few others)

Thank you.

Edit: clang uses ymm (but not zmm)

来源：https://stackoverflow.com/questions/65534658/trying-to-understand-clang-gcc-builtin-memset-on-constant-size-aligned-point

标签

gcc