I am experimenting with OpenMP. I wrote some code to check its performance. On a 4-core single Intel CPU with Kubuntu 11.04, the following program compiled with OpenMP is around
I am observing similar behavior on GCC. However I am wondering if in my case it is somehow related with template or inline function. Is your code also within template or inline function? Please look here.
However for very short for loops, you may observe some small overhead related with thread switching like in your case:
#pragma omp parallel for
for (int i = 0; i < 100000000; i ++) {;}
If your loop executes for some seriously long time as few ms or even seconds, you should observe performance boost when using OpenMP. But only when you have more than one CPU. The more cores you have, the higher performance you reach with OpenMP.