OpenMP offloading to Nvidia wrong reduction

前端未结

关注

 1  1235

I am interested in offloading work to the GPU with OpenMP.

The code below gives the correct value of sum on the CPU

//g++ -O3 -Wall foo


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  夕颜        
                
              
                            
                2020-12-18 00:02
              
            
            
                                                                       
The solution was to add the clause map(tofrom:sum) like this:

//g++ -O3 -Wall foo.cpp -fopenmp -fno-stack-protector
#pragma omp target teams distribute parallel for reduction(+:sum) map(tofrom:sum)
for(int i = 0 ; i < 2000000000; i++) sum += i%11;


This gets the correct result for sum however the code is still much slower than with OpenACC or OpenMP without target.  

Update: the solution to the speed was to add the simd clause. See the end of this answer for more information.



The solution above has a lot of clauses on one line. It can be broken up like this:

#pragma omp target data map(tofrom: sum)
#pragma omp target teams distribute parallel for reduction(+:sum)
for(int i = 0 ; i < 2000000000; i++) sum += i%11;




Another option is to use defaultmap(tofrom:scalar)

#pragma omp target teams distribute parallel for reduction(+:sum) defaultmap(tofrom:scalar)


Apparently, scalar variables in OpenMP 4.5 are firstprivate by default.
https://developers.redhat.com/blog/2016/03/22/what-is-new-in-openmp-4-5-3/

defaultmap(tofrom:scalar) is convenient if you have multiple scalar values you want shared.



I also implemented the reduction manually to see if I could speed it up.  I have not managed to speed it up but here is the code anyway (there are other optimizations I have tried but none of them have helped).

#include <omp.h>
#include <stdio.h>

//g++ -O3 -Wall acc2.cpp -fopenmp -fno-stack-protector
//sudo nvprof ./a.out

static inline int foo(int a, int b, int c) {
  return a > b ? (a/c)*b + (a%c)*b/c : (b/c)*a + (b%c)*a/c;
}

int main (void) {
  int nteams = 0, nthreads = 0;

  #pragma omp target teams map(tofrom: nteams) map(tofrom:nthreads)
  {
    nteams = omp_get_num_teams();
    #pragma omp parallel
    #pragma omp single
    nthreads = omp_get_num_threads();
  }
  int N = 2000000000;
  int sum = 0;

  #pragma omp declare target(foo)  

  #pragma omp target teams map(tofrom: sum)
  {
    int nteams = omp_get_num_teams();
    int iteam = omp_get_team_num();
    int start  = foo(iteam+0, N, nteams);
    int finish = foo(iteam+1, N, nteams);    
    int n2 = finish - start;
    #pragma omp parallel
    {
      int sum_team = 0;
      int ithread = omp_get_thread_num();
      int nthreads = omp_get_num_threads();
      int start2  = foo(ithread+0, n2, nthreads) + start;
      int finish2 = foo(ithread+1, n2, nthreads) + start;
      for(int i=start2; i<finish2; i++) sum_team += i%11;
      #pragma omp atomic
      sum += sum_team;
    }   
  }   

  printf("devices %d\n", omp_get_num_devices());
  printf("default device %d\n", omp_get_default_device());
  printf("device id %d\n", omp_get_initial_device());
  printf("nteams %d\n", nteams);
  printf("nthreads per team %d\n", nthreads);
  printf("total threads %d\n", nteams*nthreads);
  printf("sum %d\n", sum);
  return 0;
}


nvprof shows that most of the time is spend with cuCtxSynchronize. With OpenACC it's about half of that.



I finally managed to dramatically speed up the reduction. The solution was to add the simd clause

#pragma omp target teams distribute parallel for simd reduction(+:sum) map(tofrom:sum).


That's nine clauses on one line. A slightly shorter solution is

#pragma omp target map(tofrom:sum)
#pragma omp teams distribute parallel for simd reduction(+:sum)


The times are

OMP_GPU    0.25 s
ACC        0.47 s
OMP_CPU    0.64 s


OpenMP on the GPU now is much faster than OpenACC and OpenMP on the CPU . I don't know if OpenACC can be sped up with with some additional clauses.  

Hopefully, Ubuntu 18.04 fixes gcc-offload-nvptx so that it does not need -fno-stack-protector.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复