C++ OpenMP: Split for loop in even chunks static and join data at the end

前端未结

关注

 2  1508

I\'m trying to make a for loop multi-threaded in C++ so that the calculation gets divided to the multiple threads. Yet it contains data that needs to be joined together in t

相关标签:

2条回答

无人及你

2020-12-21 19:45
If you really want to preserve the same order as in the serial case, then there is no other way than doing it serially. In that case you can maybe try to parallelize the operations done in operator+=.

If the operations can be done randomly, but the reduction of the blocks has a specific order , then it may be worth having a look at TBB parallel_reduce. It will require you to write more code, but if I remember well you can define complex custom reduction operations.

If the order of the operations doesn't matter, then your snippet is almost complete. What it lacks is possibly a critical construct to aggregate private data:
```
std::vector<int> ids;               // mappings
std::map<int, myData> combineData;  // data per id
myData outputData;                  // combined data based on the mappings

#pragma omp parallel
{ 
    myData threadData;              // data per thread

    #pragma omp for nowait
    for (int ii =0; ii < total_iterations; ii++)
    {
        threadData += combineData[ids[ii]];
    }
    #pragma omp critical
    {
        outputData += threadData;
    }    
    #pragma omp barrier
    // From here on you are ensured that every thread sees 
    // the correct value of outputData 
 }
```
The schedule of the for loop in this case is not important for the semantic. If the overload of operator+= is a relatively stable operation (in terms of the time needed to compute it), then you can use schedule(static) which divides the iterations evenly among threads. Otherwise you can resort to other scheduling to balance the computational burden (e.g. schedule(guided)).

Finally if myData is a typedef of an intrinsic type, then you can avoid the critical section and use a reduction clause:
```
    #pragma omp for reduction(+:outputData)
    for (int ii =0; ii < total_iterations; ii++)
    {
        outputData += combineData[ids[ii]];
    }
```
In this case you don't need to declare anything explicitly as private.
0 讨论(0)
发布评论:

提交评论
- 加载中...
我在风中等你

2020-12-21 20:00
This depends on a few properties of the the addition operator of myData. If the operator is both associative (A + B) + C = A + (B + C) as well as commutative A + B = B + A then you can use a critical section or if the data is plain old data (e.g. a float, int,...) a reduction.

However, if it's not commutative as you say (order of operation matters) but still associative you can fill an array with a number of elements equal to the number of threads of the combined data in parallel and then merge them in order in serial (see the code below. Using schedule(static) will split the chunks more or less evenly and with increasing thread number as you want.

If the operator is neither associative nor commutative then I don't think you can parallelize it (efficiently - e.g. try parallelizing a Fibonacci series efficiently).
```
std::vector<int> ids;               // mappings
std::map<int, myData> combineData;  // data per id
myData outputData;                  // combined data based on the mappings
myData *threadData;
int nthreads;
#pragma omp parallel
{
    #pragma omp single
    {
        nthreads = omp_get_num_threads();
        threadData = new myData[nthreads];
    }
    myData tmp;
    #pragma omp for schedule(static)
    for (int i=0; i<30000; i++) {
        tmp += combineData[ids[i]];
    }
    threadData[omp_get_thread_num()] = tmp;
}
for(int i=0; i<nthreads; i++) {
     outputData += threadData[i];
}
delete[] threadData;
```
Edit: I'm not 100% sure at this point if the chunks will assigned in order of increasing thread number with #pragma omp for schedule(static) (though I would be surprised if they are not). There is an ongoing discussion on this issue. Meanwhile, if you want to be 100% sure then instead of
```
#pragma omp for schedule(static)
for (int i=0; i<30000; i++) {
    tmp += combineData[ids[i]];
}
```
you can do
```
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int start = ithread*30000/nthreads;
const int finish = (ithread+1)*30000/nthreads;
for(int i = start; i<finish; i++) {
     tmp += combineData[ids[i]];          
}
```
Edit:

I found a more elegant way to fill in parallel but merge in order
```
#pragma omp parallel
{
    myData tmp;
    #pragma omp for schedule(static) nowait 
    for (int i=0; i<30000; i++) {
        tmp += combineData[ids[i]];
    }
    #pragma omp for schedule(static) ordered 
    for(int i=0; i<omp_get_num_threads(); i++) {
        #pragma omp ordered
        outputData += tmp;
    }
}
```
This avoids allocating data for each thread (threadData) and merging outside the parallel region.
0 讨论(0)
发布评论:

提交评论
- 加载中...