Unexpectedly good performance with openmp parallel for loop

前端 未结 1 1195
悲&欢浪女
悲&欢浪女 2020-12-06 17:32

I have edited my question after previous comments (especially @Zboson) for better readability

I have always acted on, and observed, the conventional

相关标签:
1条回答
  • 2020-12-06 18:08

    The problem is likely due to the clock() function. It does not return the wall time on Linux. You should use the function omp_get_wtime(). It's more accurate than clock and works on GCC, ICC, and MSVC. In fact I use it for timing code even when I'm not using OpenMP.

    I tested your code with it here http://coliru.stacked-crooked.com/a/26f4e8c9fdae5cc2

    Edit: Another thing to consider which may be causing your problem is that exp and sin function which you are using are compiled WITHOUT AVX support. Your code is compiled with AVX support (actually AVX2). You can see this from GCC explorer with your code if you compile with -fopenmp -mavx2 -mfma Whenever you call a function without AVX support from code with AVX you need to zero the upper part of the YMM register or pay a large penalty. You can do this with the intrinsic _mm256_zeroupper (VZEROUPPER). Clang does this for you but last I checked GCC does not so you have to do it yourself (see the comments to this question Math functions takes more cycles after running any intel AVX function and also the answer here Using AVX CPU instructions: Poor performance without "/arch:AVX"). So every iteration you are have a large delay due to not calling VZEROUPPER. I'm not sure why this is what matters with multiple threads but if GCC does this each time it starts a new thread then it could help explain what you are seeing.

    #include <immintrin.h>
    
    #pragma omp parallel for
    for (int i = 0; i < n; ++i) {
        _mm256_zeroupper();
        B[i] = sin(B[i]);
        _mm256_zeroupper();
        B[i] += exp(A[i]);       
    }
    

    Edit A simpler way to test do this is to instead of compiling with -march=native don't set the arch (gcc -Ofast -std=c99 -fopenmp -Wa) or just use SSE2 (gcc -Ofast -msse2 -std=c99 -fopenmp -Wa).

    Edit GCC 4.8 has an option -mvzeroupper which may be the most convenient solution.

    This option instructs GCC to emit a vzeroupper instruction before a transfer of control flow out of the function to minimize the AVX to SSE transition penalty as well as remove unnecessary zeroupper intrinsics.

    0 讨论(0)
提交回复
热议问题