问题
Intel's intrinsic guide lists the intrinsic _mm256_loadu_epi32:
_m256i _mm256_loadu_epi32 (void const* mem_addr);
/*
Instruction: vmovdqu32 ymm, m256
CPUID Flags: AVX512VL + AVX512F
Description
Load 256-bits (composed of 8 packed 32-bit integers) from memory into dst.
mem_addr does not need to be aligned on any particular boundary.
Operation
a[255:0] := MEM[mem_addr+255:mem_addr]
dst[MAX:256] := 0
*/
But clang and gcc do not provide this intrinsic. Instead they provide (in file avx512vlintrin.h
) only the masked versions
_mm256_mask_loadu_epi32 (__m256i, __mmask8, void const *);
_mm256_maskz_loadu_epi32 (__mmask8, void const *);
which boil down to the same instruction vmovdqu32
. My question: how can I emulate _mm256_loadu_epi32
:
inline _m256i _mm256_loadu_epi32(void const* mem_addr)
{
/* code using vmovdqu32 and compiles with gcc */
}
without writing assembly, i.e. using only intrinsics available?
回答1:
Just use _mm256_loadu_si256
like a normal person. The only thing the AVX512 intrinsic gives you is a nicer prototype (const void*
instead of const __m256i*
).
@chtz suggests out that you might still want to write a wrapper function yourself to get the void*
prototype. But don't call it _mm256_loadu_epi32
; some future GCC version will probably add that for compat with Intel's docs and break your code.
You don't even want the compiler to emit vmovdqu32 ymm
when you're not masking; vmovdqu ymm
is shorter and does exactly the same thing, with no penalty for mixing with EVEX-encoded instructions. The compiler can always use an vmovdqu32
or 64
if it wants to load into ymm16..31, otherwise you want it to use a shorter VEX-coded AVX1 vmovdqu
.
I'm pretty sure that GCC treats _mm256_maskz_epi32(0xffu,ptr)
exactly the same as _mm256_loadu_si256((const __m256i*)ptr)
and makes the same asm regardless of which one you use. It can optimize away the 0xffu
mask and simply use an unmasked load, but there's no need for that extra complication in your source.
But unfortunately current GCC will pessimize to vmovdqu32 ymm0, [mem]
when AVX512VL is enabled (e.g. -march=skylake-avx512
) even when you write _mm256_loadu_si256
. This is a missed-optimization, GCC Bug 89346.
It doesn't matter which 256-bit load intrinsic you use (except for aligned vs. unaligned) as long as there's no masking.
Related:
- error: '_mm512_loadu_epi64' was not declared in this scope
- What is the difference between _mm512_load_epi32 and _mm512_load_si512?
来源:https://stackoverflow.com/questions/59649287/how-to-emulate-mm256-loadu-epi32-with-gcc-or-clang