Why is this SIMD multiplication not faster than non-SIMD multiplication?

前端 未结 3 635
轻奢々
轻奢々 2020-12-03 21:22

Let\'s assume that we have a function that multiplies two arrays of 1000000 doubles each. In C/C++ the function looks like this:

void mul_c(double* a, double         


        
相关标签:
3条回答
  • 2020-12-03 22:09

    There was a major bug in the timing function I used for previous benchmarks. This grossly underestimated the bandwidth without vectorization as well as other measurements. Additionally, there was another problem that was overestimating the bandwidth due to COW on the array that was read but not written to. Finally, the maximum bandwidth I used was incorrect. I have updated my answer with the corrections and I have left the old answer at the end of this answer.


    Your operation is memory bandwidth bound. This means the CPU is spending most of its time waiting on slow memory reads and writes. An excellent explanation for this can be found here: Why vectorizing the loop does not have performance improvement.

    However, I have to disagree slightly with one statement in that answer.

    So regardless of how it's optimized, (vectorized, unrolled, etc...) it isn't gonna get much faster.

    In fact, vectorization, unrolling, and multiple threads can significantly increase the bandwidth even in memory bandwidth bound operations. The reason is that it is difficult to obtain the maximum memory bandwidth. A good explanation for this can be found here: https://stackoverflow.com/a/25187492/2542702.

    The rest of my answer will show how vectorization and multiple threads can get closer to the maximum memory bandwidth.

    My test system: Ubuntu 16.10, Skylake (i7-6700HQ@2.60GHz), 32GB RAM, dual channel DDR4@2400 GHz. The maximum bandwidth from my system is 38.4 GB/s.

    From the code below I produce the following tables. I set the number of thread using OMP_NUM_THREADS e.g. export OMP_NUM_THREADS=4. The efficiency is bandwidth/max_bandwidth.

    -O2 -march=native -fopenmp
    Threads Efficiency
    1       59.2%
    2       76.6%
    4       74.3%
    8       70.7%
    
    -O2 -march=native -fopenmp -funroll-loops
    1       55.8%
    2       76.5%
    4       72.1%
    8       72.2%
    
    -O3 -march=native -fopenmp
    1       63.9%
    2       74.6%
    4       63.9%
    8       63.2%
    
    -O3 -march=native -fopenmp -mprefer-avx128
    1       67.8%
    2       76.0%
    4       63.9%
    8       63.2%
    
    -O3 -march=native -fopenmp -mprefer-avx128 -funroll-loops
    1       68.8%
    2       73.9%
    4       69.0%
    8       66.8%
    

    After several iterations of running due to uncertainties in the measurements I have formed the following conclusions:

    • single threaded scalar operations get more than 50% of the bandwidth.
    • two threaded scalar operations get the highest bandwidth.
    • single threaded vector operations are faster than single threaded scalar operations.
    • single threaded SSE operations are faster than single threaded AVX operations.
    • unrolling is not helpful.
    • unrolling single-threaded operations is slower than without unrolling.
    • more threads than cores (Hyper-Threading) gives a lower bandwidth.

    The solution that gives the best bandwidth is scalar operations with two threads.

    The code I used to benchmark:

    #include <stdlib.h>
    #include <string.h>
    #include <stdio.h>
    #include <omp.h>
    
    #define N 10000000
    #define R 100
    
    void mul(double *a, double *b) {
      #pragma omp parallel for
      for (int i = 0; i<N; i++) a[i] *= b[i];
    }
    
    int main() {
      double maxbw = 2.4*2*8; // 2.4GHz * 2-channels * 64-bits * 1-byte/8-bits 
      double mem = 3*sizeof(double)*N*R*1E-9; // GB
    
      double *a = (double*)malloc(sizeof *a * N);
      double *b = (double*)malloc(sizeof *b * N);
    
      //due to copy-on-write b must be initialized to get the correct bandwidth
      //also, GCC will convert malloc + memset(0) to calloc so use memset(1)
      memset(b, 1, sizeof *b * N);
    
      double dtime = -omp_get_wtime();
      for(int i=0; i<R; i++) mul(a,b);
      dtime += omp_get_wtime();
      printf("%.2f s, %.1f GB/s, %.1f%%\n", dtime, mem/dtime, 100*mem/dtime/maxbw);
    
      free(a), free(b);
    }
    

    The old solution with the timing bug

    The modern solution for inline assembly is to use intrinsics. There are still cases where one needs inline assembly but this is not one of them.

    One intrinsics solution for you inline assembly approach is simply:

    void mul_SSE(double*  a, double*  b) {
      for (int i = 0; i<N/2; i++) 
          _mm_store_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
    }
    

    Let me define some test code

    #include <x86intrin.h>
    #include <string.h>
    #include <stdio.h>
    #include <x86intrin.h>
    #include <omp.h>
    
    #define N 1000000
    #define R 1000
    
    typedef __attribute__(( aligned(32)))  double aligned_double;
    void  (*fp)(aligned_double *a, aligned_double *b);
    
    void mul(aligned_double* __restrict a, aligned_double* __restrict b) {
      for (int i = 0; i<N; i++) a[i] *= b[i];
    }
    
    void mul_SSE(double*  a, double*  b) {
      for (int i = 0; i<N/2; i++) _mm_store_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
    }
    
    void mul_SSE_NT(double*  a, double*  b) {
      for (int i = 0; i<N/2; i++) _mm_stream_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
    }
    
    void mul_SSE_OMP(double*  a, double*  b) {
      #pragma omp parallel for
      for (int i = 0; i<N; i++) a[i] *= b[i];
    }
    
    void test(aligned_double *a, aligned_double *b, const char *name) {
      double dtime;
      const double mem = 3*sizeof(double)*N*R/1024/1024/1024;
      const double maxbw = 34.1;
      dtime = -omp_get_wtime();
      for(int i=0; i<R; i++) fp(a,b);
      dtime += omp_get_wtime();
      printf("%s \t time %.2f s, %.1f GB/s, efficency %.1f%%\n", name, dtime, mem/dtime, 100*mem/dtime/maxbw);
    }
    
    int main() {
      double *a = (double*)_mm_malloc(sizeof *a * N, 32);
      double *b = (double*)_mm_malloc(sizeof *b * N, 32);
    
      //b must be initialized to get the correct bandwidth!!!
      memset(a, 1, sizeof *a * N);
      memset(b, 1, sizeof *a * N);
    
      fp = mul,         test(a,b, "mul        ");
      fp = mul_SSE,     test(a,b, "mul_SSE    ");
      fp = mul_SSE_NT,  test(a,b, "mul_SSE_NT ");
      fp = mul_SSE_OMP, test(a,b, "mul_SSE_OMP");
    
      _mm_free(a), _mm_free(b);
    }
    

    Now the first test

    g++ -O2 -fopenmp test.cpp
    ./a.out
    mul              time 1.67 s, 13.1 GB/s, efficiency 38.5%
    mul_SSE          time 1.00 s, 21.9 GB/s, efficiency 64.3%
    mul_SSE_NT       time 1.05 s, 20.9 GB/s, efficiency 61.4%
    mul_SSE_OMP      time 0.74 s, 29.7 GB/s, efficiency 87.0%
    

    So with -O2 which does not vectorize loops we see that the intrinsic SSE version is much faster than the plain C solution mul. efficiency = bandwith_measured/max_bandwidth where the max is 34.1 GB/s for my system.

    Second test

    g++ -O3 -fopenmp test.cpp
    ./a.out
    mul              time 1.05 s, 20.9 GB/s, efficiency 61.2%
    mul_SSE          time 0.99 s, 22.3 GB/s, efficiency 65.3%
    mul_SSE_NT       time 1.01 s, 21.7 GB/s, efficiency 63.7%
    mul_SSE_OMP      time 0.68 s, 32.5 GB/s, efficiency 95.2%
    

    With -O3 vectorizes the loop and the intrinsic function offers essentially no advantage.

    Third test

    g++ -O3 -fopenmp -funroll-loops test.cpp
    ./a.out
    mul              time 0.85 s, 25.9 GB/s, efficency 76.1%
    mul_SSE          time 0.84 s, 26.2 GB/s, efficency 76.7%
    mul_SSE_NT       time 1.06 s, 20.8 GB/s, efficency 61.0%
    mul_SSE_OMP      time 0.76 s, 29.0 GB/s, efficency 85.0%
    

    With -funroll-loops GCC unrolls the loops eight times and we see a significant improvement except for the non-temporal store solution and not real advantage for OpenMP solution.

    Before unrolling the loop the assembly for mul wiht -O3 is

        xor     eax, eax
    .L2:
        movupd  xmm0, XMMWORD PTR [rsi+rax]
        mulpd   xmm0, XMMWORD PTR [rdi+rax]
        movaps  XMMWORD PTR [rdi+rax], xmm0
        add     rax, 16
        cmp     rax, 8000000
        jne     .L2
        rep ret
    

    With -O3 -funroll-loops the assembly for mul is:

       xor     eax, eax
    .L2:
        movupd  xmm0, XMMWORD PTR [rsi+rax]
        movupd  xmm1, XMMWORD PTR [rsi+16+rax]
        mulpd   xmm0, XMMWORD PTR [rdi+rax]
        movupd  xmm2, XMMWORD PTR [rsi+32+rax]
        mulpd   xmm1, XMMWORD PTR [rdi+16+rax]
        movupd  xmm3, XMMWORD PTR [rsi+48+rax]
        mulpd   xmm2, XMMWORD PTR [rdi+32+rax]
        movupd  xmm4, XMMWORD PTR [rsi+64+rax]
        mulpd   xmm3, XMMWORD PTR [rdi+48+rax]
        movupd  xmm5, XMMWORD PTR [rsi+80+rax]
        mulpd   xmm4, XMMWORD PTR [rdi+64+rax]
        movupd  xmm6, XMMWORD PTR [rsi+96+rax]
        mulpd   xmm5, XMMWORD PTR [rdi+80+rax]
        movupd  xmm7, XMMWORD PTR [rsi+112+rax]
        mulpd   xmm6, XMMWORD PTR [rdi+96+rax]
        movaps  XMMWORD PTR [rdi+rax], xmm0
        mulpd   xmm7, XMMWORD PTR [rdi+112+rax]
        movaps  XMMWORD PTR [rdi+16+rax], xmm1
        movaps  XMMWORD PTR [rdi+32+rax], xmm2
        movaps  XMMWORD PTR [rdi+48+rax], xmm3
        movaps  XMMWORD PTR [rdi+64+rax], xmm4
        movaps  XMMWORD PTR [rdi+80+rax], xmm5
        movaps  XMMWORD PTR [rdi+96+rax], xmm6
        movaps  XMMWORD PTR [rdi+112+rax], xmm7
        sub     rax, -128
        cmp     rax, 8000000
        jne     .L2
        rep ret
    

    Fourth test

    g++ -O3 -fopenmp -mavx test.cpp
    ./a.out
    mul              time 0.87 s, 25.3 GB/s, efficiency 74.3%
    mul_SSE          time 0.88 s, 24.9 GB/s, efficiency 73.0%
    mul_SSE_NT       time 1.07 s, 20.6 GB/s, efficiency 60.5%
    mul_SSE_OMP      time 0.76 s, 29.0 GB/s, efficiency 85.2%
    

    Now the non-intrinsic function is the fastest (excluding the OpenMP version).

    So there is no reason to use intrinsics or inline assembly in this case because we can get the best performance with appropriate compiler options (e.g. -O3, -funroll-loops, -mavx).

    Test system: Ubuntu 16.10, Skylake (i7-6700HQ@2.60GHz), 32GB RAM. Maximum memory bandwidth (34.1 GB/s) https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3_50-GHz


    Here is another solution worth considering. The cmp instruction is not necessary if we count from -N up to zero and access the arrays as N+i. GCC should have fixed this a long time ago. It eliminates one instruction (though due to macro-op fusion the cmp and jmp often count as one micro-op).

    void mul_SSE_v2(double*  a, double*  b) {
      for (ptrdiff_t i = -N; i<0; i+=2)
        _mm_store_pd(&a[N + i], _mm_mul_pd(_mm_load_pd(&a[N + i]),_mm_load_pd(&b[N + i])));
    

    Assembly with -O3

    mul_SSE_v2(double*, double*):
        mov     rax, -1000000
    .L9:
        movapd  xmm0, XMMWORD PTR [rdi+8000000+rax*8]
        mulpd   xmm0, XMMWORD PTR [rsi+8000000+rax*8]
        movaps  XMMWORD PTR [rdi+8000000+rax*8], xmm0
        add     rax, 2
        jne     .L9
        rep ret
    }
    

    This optimization will only possibly be helpful the arrays fit e.g. the L1 cache i.e. not reading from main memory.


    I finally found a way to get the plain C solution to not generate the cmp instruction.

    void mul_v2(aligned_double* __restrict a, aligned_double* __restrict b) {
      for (int i = -N; i<0; i++) a[i] *= b[i];
    }
    

    And then call the function from a separate object file like this mul_v2(&a[N],&b[N]) so this is perhaps the best solution. However, if you call the function from the same object file (translation unit) as the one it's defined in the GCC generates the cmp instruction again.

    Also,

    void mul_v3(aligned_double* __restrict a, aligned_double* __restrict b) {
      for (int i = -N; i<0; i++) a[N+i] *= b[N+i];
    }
    

    still generates the cmp instruction and generates the same assembly as the mul function.


    The function mul_SSE_NT is silly. It uses non-temporal stores which are only useful when only writing to memory but since the function reads and writes to the same address non-temporal stores are not only useless they give inferior results.


    Previous versions of this answer were getting the wrong bandwidth. The reason was when the arrays were not initialized.

    0 讨论(0)
  • 2020-12-03 22:18

    I want to add another point of view to the problem. SIMD instructions give big performance boost if there is no memory bound restrictions. But there are too much memory loading and storing operations and too few CPU calculations in current example. So CPU is in time to process incoming data without using SIMD. If you use data of another type (32-bit float for example) or more complex algorithm, memory throughput won't restrict CPU performance and using of SIMD will give more advantages.

    0 讨论(0)
  • 2020-12-03 22:27

    Your asm code is really OK. What is not is the way you measure it. As I pointed in comments you should:

    a) use way more iterations - 1 million is nothing for modern CPU

    b) use HPT for measurement

    c) use RDTSC or RDTSCP to count real CPU clocks

    Additionally why you are afraid of -O3 opt? Don't forget to build code for your platform so use -march=native. If your CPU supports AVX or AVX2 compiler will take opportunity to produce even better code.

    Next thing - give compiler some hints about aliasing and allignment if you know you code.

    Here is my version of your mul_c - yes it is GCC specific but you showed you used GCC

    void mul_c(double* restrict a, double* restrict b)
    {
       a = __builtin_assume_aligned (a, 16);
       b = __builtin_assume_aligned (b, 16);
    
        for (int i = 0; i != 1000000; ++i)
        {
            a[i] = a[i] * b[i];
        }
    }
    

    It will produce:

    mul_c(double*, double*):
            xor     eax, eax
    .L2:
            movapd  xmm0, XMMWORD PTR [rdi+rax]
            mulpd   xmm0, XMMWORD PTR [rsi+rax]
            movaps  XMMWORD PTR [rdi+rax], xmm0
            add     rax, 16
            cmp     rax, 8000000
            jne     .L2
            rep ret
    

    If you have AVX2 and make sure data is 32 bytes aligned it will become

    mul_c(double*, double*):
            xor     eax, eax
    .L2:
            vmovapd ymm0, YMMWORD PTR [rdi+rax]
            vmulpd  ymm0, ymm0, YMMWORD PTR [rsi+rax]
            vmovapd YMMWORD PTR [rdi+rax], ymm0
            add     rax, 32
            cmp     rax, 8000000
            jne     .L2
            vzeroupper
            ret
    

    So no need for handcrafted asm if compiler can do it for you ;)

    0 讨论(0)
提交回复
热议问题