OpenMP offloading to Nvidia wrong reduction

前端 未结 1 1235
说谎
说谎 2020-12-17 23:06

I am interested in offloading work to the GPU with OpenMP.

The code below gives the correct value of sum on the CPU

//g++ -O3 -Wall foo         


        
相关标签:
1条回答
  • 2020-12-18 00:02

    The solution was to add the clause map(tofrom:sum) like this:

    //g++ -O3 -Wall foo.cpp -fopenmp -fno-stack-protector
    #pragma omp target teams distribute parallel for reduction(+:sum) map(tofrom:sum)
    for(int i = 0 ; i < 2000000000; i++) sum += i%11;
    

    This gets the correct result for sum however the code is still much slower than with OpenACC or OpenMP without target.

    Update: the solution to the speed was to add the simd clause. See the end of this answer for more information.


    The solution above has a lot of clauses on one line. It can be broken up like this:

    #pragma omp target data map(tofrom: sum)
    #pragma omp target teams distribute parallel for reduction(+:sum)
    for(int i = 0 ; i < 2000000000; i++) sum += i%11;
    

    Another option is to use defaultmap(tofrom:scalar)

    #pragma omp target teams distribute parallel for reduction(+:sum) defaultmap(tofrom:scalar)
    

    Apparently, scalar variables in OpenMP 4.5 are firstprivate by default. https://developers.redhat.com/blog/2016/03/22/what-is-new-in-openmp-4-5-3/

    defaultmap(tofrom:scalar) is convenient if you have multiple scalar values you want shared.


    I also implemented the reduction manually to see if I could speed it up. I have not managed to speed it up but here is the code anyway (there are other optimizations I have tried but none of them have helped).

    #include <omp.h>
    #include <stdio.h>
    
    //g++ -O3 -Wall acc2.cpp -fopenmp -fno-stack-protector
    //sudo nvprof ./a.out
    
    static inline int foo(int a, int b, int c) {
      return a > b ? (a/c)*b + (a%c)*b/c : (b/c)*a + (b%c)*a/c;
    }
    
    int main (void) {
      int nteams = 0, nthreads = 0;
    
      #pragma omp target teams map(tofrom: nteams) map(tofrom:nthreads)
      {
        nteams = omp_get_num_teams();
        #pragma omp parallel
        #pragma omp single
        nthreads = omp_get_num_threads();
      }
      int N = 2000000000;
      int sum = 0;
    
      #pragma omp declare target(foo)  
    
      #pragma omp target teams map(tofrom: sum)
      {
        int nteams = omp_get_num_teams();
        int iteam = omp_get_team_num();
        int start  = foo(iteam+0, N, nteams);
        int finish = foo(iteam+1, N, nteams);    
        int n2 = finish - start;
        #pragma omp parallel
        {
          int sum_team = 0;
          int ithread = omp_get_thread_num();
          int nthreads = omp_get_num_threads();
          int start2  = foo(ithread+0, n2, nthreads) + start;
          int finish2 = foo(ithread+1, n2, nthreads) + start;
          for(int i=start2; i<finish2; i++) sum_team += i%11;
          #pragma omp atomic
          sum += sum_team;
        }   
      }   
    
      printf("devices %d\n", omp_get_num_devices());
      printf("default device %d\n", omp_get_default_device());
      printf("device id %d\n", omp_get_initial_device());
      printf("nteams %d\n", nteams);
      printf("nthreads per team %d\n", nthreads);
      printf("total threads %d\n", nteams*nthreads);
      printf("sum %d\n", sum);
      return 0;
    }
    

    nvprof shows that most of the time is spend with cuCtxSynchronize. With OpenACC it's about half of that.


    I finally managed to dramatically speed up the reduction. The solution was to add the simd clause

    #pragma omp target teams distribute parallel for simd reduction(+:sum) map(tofrom:sum).
    

    That's nine clauses on one line. A slightly shorter solution is

    #pragma omp target map(tofrom:sum)
    #pragma omp teams distribute parallel for simd reduction(+:sum)
    

    The times are

    OMP_GPU    0.25 s
    ACC        0.47 s
    OMP_CPU    0.64 s
    

    OpenMP on the GPU now is much faster than OpenACC and OpenMP on the CPU . I don't know if OpenACC can be sped up with with some additional clauses.

    Hopefully, Ubuntu 18.04 fixes gcc-offload-nvptx so that it does not need -fno-stack-protector.

    0 讨论(0)
提交回复
热议问题