I believe everyone agree with the title of this post. Can someone point me the reason ? Any reference to that like book etc ? I have tried to find but no luck.
I believe
While there is some overhead at runtime from using OpenMP even with only one thread, the more important issue is likely to be that the code transformations that the compiler has to perform to generate OpenMP code (in particular outlining the parallel region code into separate functions [done by gcc and icc; PGI do something different...]) will be affecting other code optimizations (like vectorization). Information that the compiler has in a single function that allows optimizations potentially gets lost when parts of the code are executed in the outlined functions, so the generated code may be worse.