Fast(est) way to write a seqence of integer to global memory?

后端 未结 2 459
灰色年华
灰色年华 2021-02-05 22:03

The task is very simple, writting a seqence of integer variable to memory:

Original code:

for (size_t i=0; i<1000*1000*1000; ++i)
{
   data[i]=i;
};
<         


        
相关标签:
2条回答
  • 2021-02-05 22:30

    Is there any reason why you would expect all of data[] to be in powered-up RAM pages?

    The DDR3 pre-fetchter will correctly predict most accesses but the frequent x86-64 page boundaries might be an issue. You're writing to virtual memory, so at each page boundary there's a potential mis-prediction of the pre-fetcher. You can greatly reduce this by using large pages (e.g. MEM_LARGE_PAGES on Windows).

    0 讨论(0)
  • 2021-02-05 22:49

    Assuming this is x86, and that you are not already saturating your available DRAM bandwidth, you can try using SSE2 or AVX2 to write 2 or 4 elements at a time:

    SSE2:

    #include "emmintrin.h"
    
    const __m128i v2 = _mm_set1_epi64x(2);
    __m128i v = _mm_set_epi64x(1, 0);
    
    for (size_t i=0; i<1000*1000*1000; i += 2)
    {
        _mm_stream_si128((__m128i *)&data[i], v);
        v = _mm_add_epi64(v, v2);
    }
    

    AVX2:

    #include "immintrin.h"
    
    const __m256i v4 = _mm256_set1_epi64x(4);
    __m256i v = _mm256_set_epi64x(3, 2, 1, 0);
    
    for (size_t i=0; i<1000*1000*1000; i += 4)
    {
        _mm256_stream_si256((__m256i *)&data[i], v);
        v = _mm256_add_epi64(v, v4);
    }
    

    Note that data needs to be suitably aligned (16 byte or 32 byte boundary).

    AVX2 is only available on Intel Haswell and later, but SSE2 is pretty much universal these days.


    FWIW I put together a test harness with a scalar loop and the above SSE and AVX loops compiled it with clang, and tested it on a Haswell MacBook Air (1600MHz LPDDR3 DRAM). I got the following results:

    # sequence_scalar: t = 0.870903 s = 8.76033 GB / s
    # sequence_SSE: t = 0.429768 s = 17.7524 GB / s
    # sequence_AVX: t = 0.431182 s = 17.6941 GB / s
    

    I also tried it on a Linux desktop PC with a 3.6 GHz Haswell, compiling with gcc 4.7.2, and got the following:

    # sequence_scalar: t = 0.816692 s = 9.34183 GB / s
    # sequence_SSE: t = 0.39286 s = 19.4201 GB / s
    # sequence_AVX: t = 0.392545 s = 19.4357 GB / s
    

    So it looks like the SIMD implementations give a 2x or more improvement over 64 bit scalar code (although 256 bit SIMD doesn't seem to give any improvement over 128 bit SIMD), and that typical throughput should be a lot faster than 5 GB / s.

    My guess is that there is something wrong with the OP's system or benchmarking code which is resulting in an apparently reduced throughput.

    0 讨论(0)
提交回复
热议问题