I am interested in offloading work to the GPU with OpenMP.
The code below gives the correct value of sum
on the CPU
//g++ -O3 -Wall foo
The solution was to add the clause map(tofrom:sum)
like this:
//g++ -O3 -Wall foo.cpp -fopenmp -fno-stack-protector
#pragma omp target teams distribute parallel for reduction(+:sum) map(tofrom:sum)
for(int i = 0 ; i < 2000000000; i++) sum += i%11;
This gets the correct result for sum
however the code is still much slower than with OpenACC or OpenMP without target
.
Update: the solution to the speed was to add the simd
clause. See the end of this answer for more information.
The solution above has a lot of clauses on one line. It can be broken up like this:
#pragma omp target data map(tofrom: sum)
#pragma omp target teams distribute parallel for reduction(+:sum)
for(int i = 0 ; i < 2000000000; i++) sum += i%11;
Another option is to use defaultmap(tofrom:scalar)
#pragma omp target teams distribute parallel for reduction(+:sum) defaultmap(tofrom:scalar)
Apparently, scalar variables in OpenMP 4.5 are firstprivate
by default.
https://developers.redhat.com/blog/2016/03/22/what-is-new-in-openmp-4-5-3/
defaultmap(tofrom:scalar)
is convenient if you have multiple scalar values you want shared.
I also implemented the reduction manually to see if I could speed it up. I have not managed to speed it up but here is the code anyway (there are other optimizations I have tried but none of them have helped).
#include <omp.h>
#include <stdio.h>
//g++ -O3 -Wall acc2.cpp -fopenmp -fno-stack-protector
//sudo nvprof ./a.out
static inline int foo(int a, int b, int c) {
return a > b ? (a/c)*b + (a%c)*b/c : (b/c)*a + (b%c)*a/c;
}
int main (void) {
int nteams = 0, nthreads = 0;
#pragma omp target teams map(tofrom: nteams) map(tofrom:nthreads)
{
nteams = omp_get_num_teams();
#pragma omp parallel
#pragma omp single
nthreads = omp_get_num_threads();
}
int N = 2000000000;
int sum = 0;
#pragma omp declare target(foo)
#pragma omp target teams map(tofrom: sum)
{
int nteams = omp_get_num_teams();
int iteam = omp_get_team_num();
int start = foo(iteam+0, N, nteams);
int finish = foo(iteam+1, N, nteams);
int n2 = finish - start;
#pragma omp parallel
{
int sum_team = 0;
int ithread = omp_get_thread_num();
int nthreads = omp_get_num_threads();
int start2 = foo(ithread+0, n2, nthreads) + start;
int finish2 = foo(ithread+1, n2, nthreads) + start;
for(int i=start2; i<finish2; i++) sum_team += i%11;
#pragma omp atomic
sum += sum_team;
}
}
printf("devices %d\n", omp_get_num_devices());
printf("default device %d\n", omp_get_default_device());
printf("device id %d\n", omp_get_initial_device());
printf("nteams %d\n", nteams);
printf("nthreads per team %d\n", nthreads);
printf("total threads %d\n", nteams*nthreads);
printf("sum %d\n", sum);
return 0;
}
nvprof
shows that most of the time is spend with cuCtxSynchronize
. With OpenACC it's about half of that.
I finally managed to dramatically speed up the reduction. The solution was to add the simd
clause
#pragma omp target teams distribute parallel for simd reduction(+:sum) map(tofrom:sum).
That's nine clauses on one line. A slightly shorter solution is
#pragma omp target map(tofrom:sum)
#pragma omp teams distribute parallel for simd reduction(+:sum)
The times are
OMP_GPU 0.25 s
ACC 0.47 s
OMP_CPU 0.64 s
OpenMP on the GPU now is much faster than OpenACC and OpenMP on the CPU . I don't know if OpenACC can be sped up with with some additional clauses.
Hopefully, Ubuntu 18.04 fixes gcc-offload-nvptx
so that it does not need -fno-stack-protector
.