问题
The code i'm trying to parallelize in open mp is a Monte Carlo that boils down to something like this:
int seed = 0;
std::mt19937 rng(seed);
double result = 0.0;
int N = 1000;
#pragma omp parallel for
for(i=0; x < N; i++)
{
result += rng()
}
std::cout << result << std::endl;
I want to make sure that the state of the random number generator is shared across threads, and the addition to the result is atomic.
Is there a way of replacing this code with something from thrust::omp. From the research that I did so far it looks like thrust::omp is more of a directive to use multiple CPU threads rather than GPU for some standard thrust operations.
回答1:
Yes, it's possible to use thrust to do something similar, with (parallel) execution on the host CPU using OMP threads underneath the thrust OMP backend. Here's one example:
$ cat t535.cpp
#include <random>
#include <iostream>
#include <thrust/system/omp/execution_policy.h>
#include <thrust/system/omp/vector.h>
#include <thrust/reduce.h>
int main(int argc, char *argv[]){
unsigned N = 1;
int seed = 0;
if (argc > 1) N = atoi(argv[1]);
if (argc > 2) seed = atoi(argv[2]);
std::mt19937 rng(seed);
unsigned long result = 0;
thrust::omp::vector<unsigned long> vec(N);
thrust::generate(thrust::omp::par, vec.begin(), vec.end(), rng);
result = thrust::reduce(thrust::omp::par, vec.begin(), vec.end());
std::cout << result << std::endl;
return 0;
}
$ g++ -std=c++11 -O2 -I/usr/local/cuda/include -o t535 t535.cpp -fopenmp -lgomp
$ time ./t535 100000000
214746750809749347
real 0m0.700s
user 0m2.108s
sys 0m0.600s
$
For this test I used Fedora 20, with CUDA 6.5RC, running on a 4-core Xeon CPU (netting about a 3x speedup based on time
results). There are probably some further "optimizations" that could be made for this particular code, but I think they will unnecessarily clutter the idea, and I assume that your actual application is more complicated than just summing random numbers.
Much of what I show here was lifted from the thrust direct system access page but there are several comparable methods to access the OMP backend, depending on whether you want to have a flexible, retargettable code, or you want one that specifically uses the OMP backend (this one specifically targets OMP backend).
The thrust::reduction operation guarantees the "atomicity" you are looking for. Specifically, it guarantees that two threads are not trying to update a single location at the same time. However the use of std::mt19937
in a multithreaded OMP app is outside the scope of my answer, I think. If I create an ordinary OMP app using the code you provided, I observe variability in the results due (I think) to some interaction between the use of the std::mt19937
rng in multiple OMP threads. This is not something thrust can sort out for you.
Thrust also has random number generators, which are designed to work with it.
来源:https://stackoverflow.com/questions/25311538/thrust-equivalent-of-open-mp-code