Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2
问题 I am looking for an optimal method to calculate sum of all packed 32-bit integers in a __m256i or __m512i . To calculate sum of n elements, I ofter use log2(n) vpaddd and vpermd function, then extract the final result. Howerver, it is not the best option I think. Edit: best/optimal in term of speed/cycle reduction. 回答1: (Related: if you're looking for the non-existant _mm512_reduce_add_epu8 , see Summing 8-bit integers in __m512i with AVX intrinsics; vpsadbw as an hsum within qwords is much