Best way to store 256 bit AVX vectors into unsigned long integers

前端 未结 1 603
花落未央
花落未央 2021-01-16 01:40

I was wondering what is the best way to store a 256 bit long AVX vectors into 4 64 bit unsigned long integers. According to the functions written in the website https://soft

相关标签:
1条回答
  • 2021-01-16 02:39

    As other said in comments you do not need to use mask store in this case. the following loop got no error in your program

    for(i=0;i<32;i++){
       _mm256_storeu_si256 ((__m256i const *) bit_out[i], v_bit[i]);
    
    }
    

    So the best instruction that you are looking for is _mm256_storeu_si256 this instruction stores a __m256i vector to unaligned address if your data are aligned you can use _mm256_store_si256. to see your vectors values you can use this function:

    #include <stdalign.h>
    alignas(32) unsigned long long int tempu64[4];
    void printVecu64(__m256i vec)
    {
        _mm256_store_si256((__m256i *)&tempu64[0], vec);
        printf("[0]= %u, [1]=%u, [2]=%u, [3]=%u \n\n", tempu64[0],tempu64[1],tempu64[2],tempu64[3]) ;
    
    }
    

    the _mm256_maskstore_epi64 let you choose the elements that you are going to store to the memory. This instruction is useful when you want to store a vector with more options to store an element to the memory or not change the memory value.

    I was reading the Intel 64 and IA-32 Architectures Optimization Reference Manual (248966-032), 2016, page 410. and interestingly found out that unaligned store is still a performance killer.

    11.6.3 Prefer Aligned Stores Over Aligned Loads

    There are cases where it is possible to align only a subset of the processed data buffers. In these cases, aligning data buffers used for store operations usually yields better performance than aligning data buffers used for load operations. Unaligned stores are likely to cause greater performance degradation than unaligned loads, since there is a very high penalty on stores to a split cache-line that crosses pages. This penalty is estimated at 150 cycles. Loads that cross a page boundary are executed at retirement. In Example 11-12, unaligned store address can affect SAXPY performance for 3 unaligned addresses to about one quarter of the aligned case.

    I shared here because some people said there are no differences between aligned/unaligned store except in debuging!

    0 讨论(0)
提交回复
热议问题