I am experimenting with OpenMP. I wrote some code to check its performance. On a 4-core single Intel CPU with Kubuntu 11.04, the following program compiled with OpenMP is around
The problem is that the variable k is considered to be a shared variable, so it has to be synced between the threads. A possible solution to avoid this is:
#include <math.h>
#include <iostream>
using namespace std;
int main ()
{
long double i=0;
#pragma omp parallel for reduction(+:i)
for(int t=1; t<30000000; t++){
long double k=0.7;
for(int n=1; n<16; n++){
i=i+pow(k,n);
}
}
cout << i<<"\t";
return 0;
}
Following the hint of Martin Beckett in the comment below, instead of declaring k inside the loop, you can also declare k const and outside the loop.
Otherwise, ejd is correct - the problem here does not seem bad parallelization, but bad optimization when the code is parallelized. Remember that the OpenMP implementation of gcc is pretty young and far from optimal.
Fastest code:
for (int i = 0; i < 100000000; i ++) {;}
Slightly slower code:
#pragma omp parallel for num_threads(1)
for (int i = 0; i < 100000000; i ++) {;}
2-3 times slower code:
#pragma omp parallel for
for (int i = 0; i < 100000000; i ++) {;}
no matter what it is in between { and }. A simple ; or a more complex computation, same results. I compiled under Ubuntu 13.10 64-bit, using both gcc and g++, trying different parameters -ansi -pedantic-errors -Wall -Wextra -O3, and running on an Intel quad-core 3.5GHz.
I guess thread management overhead is at fault? It doens't seem smart for OMP to create a thread everytime you need one and destroy it after. I thought there would be four (or eight) threads being either running whenever needed or sleeping.
I am observing similar behavior on GCC. However I am wondering if in my case it is somehow related with template or inline function. Is your code also within template or inline function? Please look here.
However for very short for loops, you may observe some small overhead related with thread switching like in your case:
#pragma omp parallel for
for (int i = 0; i < 100000000; i ++) {;}
If your loop executes for some seriously long time as few ms or even seconds, you should observe performance boost when using OpenMP. But only when you have more than one CPU. The more cores you have, the higher performance you reach with OpenMP.