I believe everyone agree with the title of this post. Can someone point me the reason ? Any reference to that like book etc ? I have tried to find but no luck.
I believe
As Mystical explained, it's likely due to the OpenMP overhead. I have tried to get around this by doing for example:
#pragma omp parallel for if(nthreads>1)
I thought this would only use the OpenMP overhead if nthreads>1. However, at least in Visual Studio 2012, this also has significant overhead. Therefore, in order to properly compare single threaded and multi-threaded version of a function I define two versions of the functions with and without the OpenMP pragmas.
While there is some overhead at runtime from using OpenMP even with only one thread, the more important issue is likely to be that the code transformations that the compiler has to perform to generate OpenMP code (in particular outlining the parallel region code into separate functions [done by gcc and icc; PGI do something different...]) will be affecting other code optimizations (like vectorization). Information that the compiler has in a single function that allows optimizations potentially gets lost when parts of the code are executed in the outlined functions, so the generated code may be worse.