C++ 2011 includes very cool new features, but I can\'t find a lot of example to parallelize a for-loop. So my very naive question is : how do you parallelize a simple for lo
Using this class you can do it as:
Range based loop (read and write)
pforeach(auto &val, container) {
val = sin(val);
};
Index based for-loop
auto new_container = container;
pfor(size_t i, 0, container.size()) {
new_container[i] = sin(container[i]);
};
AFAIK the simplest way to parallelize a loop, if you are sure that there are no concurrent access possible, is by using OpenMP.
It is supported by all major compilers except LLVM (as of August 2013).
Example :
for(int i = 0; i < n; ++i)
{
tab[i] *= 2;
tab2[i] /= 2;
tab3[i] += tab[i] - tab2[i];
}
This would be parallelized very easily like this :
#pragma omp parallel for
for(int i = 0; i < n; ++i)
{
tab[i] *= 2;
tab2[i] /= 2;
tab3[i] += tab[i] - tab2[i];
}
However, be aware that this is only efficient with a big number of values.
If you use g++, another very C++11-ish way of doing would be using a lambda and a for_each, and use gnu parallel extensions (which can use OpenMP behind the scene) :
__gnu_parallel::for_each(std::begin(tab), std::end(tab), [&] ()
{
stuff_of_your_loop();
});
However, for_each is mainly thought for arrays, vectors, etc...
But you can "cheat" it if you only want to iterate through a range by creating a Range
class with begin
and end
method which will mostly increment an int.
Note that for simple loops that do mathematical stuff, the algorithms in #include <numeric>
and #include <algorithm>
can all be parallelized with G++.
Well obviously it depends on what your loop does, how you choose to parallellize, and how you manage the threads lifetime.
I'm reading the book from the std C++11 threading library (that is also one of the boost.thread maintainer and wrote Just Thread ) and I can see that "it depends".
Now to give you an idea of basics using the new standard threading, I would recommend to read the book as it gives plenty of examples. Also, take a look at http://www.justsoftwaresolutions.co.uk/threading/ and https://stackoverflow.com/questions/415994/boost-thread-tutorials
Define macro using std::thread and lambda expression:
#ifndef PARALLEL_FOR
#define PARALLEL_FOR(INT_LOOP_BEGIN_INCLUSIVE, INT_LOOP_END_EXCLUSIVE,I,O) \ \
{ \
int LOOP_LIMIT=INT_LOOP_END_EXCLUSIVE-INT_LOOP_BEGIN_INCLUSIVE; \
std::thread threads[LOOP_LIMIT]; auto fParallelLoop=[&](int I){ O; }; \
for(int i=0; i<LOOP_LIMIT; i++) \
{ \
threads[i]=std::thread(fParallelLoop,i+INT_LOOP_BEGIN_INCLUSIVE); \
} \
for(int i=0; i<LOOP_LIMIT; i++) \
{ \
threads[i].join(); \
} \
} \
#endif
usage:
int aaa=0; // std::atomic<int> aaa;
PARALLEL_FOR(0,90,i,
{
aaa+=i;
});
its ugly but it works (I mean, the multi-threading part, not the non-atomic incrementing).
Can't provide a C++11 specific answer since we're still mostly using pthreads. But, as a language-agnostic answer, you parallelise something by setting it up to run in a separate function (the thread function).
In other words, you have a function like:
def processArraySegment (threadData):
arrayAddr = threadData->arrayAddr
startIdx = threadData->startIdx
endIdx = threadData->endIdx
for i = startIdx to endIdx:
doSomethingWith (arrayAddr[i])
exitThread()
and, in your main code, you can process the array in two chunks:
int xyzzy[100]
threadData->arrayAddr = xyzzy
threadData->startIdx = 0
threadData->endIdx = 49
threadData->done = false
tid1 = startThread (processArraySegment, threadData)
// caveat coder: see below.
threadData->arrayAddr = xyzzy
threadData->startIdx = 50
threadData->endIdx = 99
threadData->done = false
tid2 = startThread (processArraySegment, threadData)
waitForThreadExit (tid1)
waitForThreadExit (tid2)
(keeping in mind the caveat that you should ensure thread 1 has loaded the data into its local storage before the main thread starts modifying it for thread 2, possibly with a mutex or by using an array of structures, one per thread).
In other words, it's rarely a simple matter of just modifying a for
loop so that it runs in parallel, though that would be nice, something like:
for {threads=10} ({i} = 0; {i} < ARR_SZ; {i}++)
array[{i}] = array[{i}] + 1;
Instead, it requires a bit of rearranging your code to take advantage of threads.
And, of course, you have to ensure that it makes sense for the data to be processed in parallel. If you're setting each array element to the previous one plus 1, no amount of parallel processing will help, simply because you have to wait for the previous element to be modified first.
This particular example above simply uses an argument passed to the thread function to specify which part of the array it should process. The thread function itself contains the loop to do the work.
std::thread
is not necessarily meant to parallize loops. It is meant to be the lowlevel abstraction to build constructs like a parallel_for algorithm. If you want to parallize your loops, you should either wirte a parallel_for algorithm yourself or use existing libraires which offer task based parallism.
The following example shows how you could parallize a simple loop but on the other side also shows the disadvantages, like the missing load-balancing and the complexity for a simple loop.
typedef std::vector<int> container;
typedef container::iterator iter;
container v(100, 1);
auto worker = [] (iter begin, iter end) {
for(auto it = begin; it != end; ++it) {
*it *= 2;
}
};
// serial
worker(std::begin(v), std::end(v));
std::cout << std::accumulate(std::begin(v), std::end(v), 0) << std::endl; // 200
// parallel
std::vector<std::thread> threads(8);
const int grainsize = v.size() / 8;
auto work_iter = std::begin(v);
for(auto it = std::begin(threads); it != std::end(threads) - 1; ++it) {
*it = std::thread(worker, work_iter, work_iter + grainsize);
work_iter += grainsize;
}
threads.back() = std::thread(worker, work_iter, std::end(v));
for(auto&& i : threads) {
i.join();
}
std::cout << std::accumulate(std::begin(v), std::end(v), 0) << std::endl; // 400
Using a library which offers a parallel_for
template, it can be simplified to
parallel_for(std::begin(v), std::end(v), worker);