Why does my code run slower with multiple threads than with a single thread when it is compiled for profiling (-pg)?

我的梦境 提交于 2019-12-03 21:49:40

What do you get without the -pg flag? That's not debugging symbols (which don't affect the code generation), that's for profiling (which does).

It's quite plausible that profiling in a multithreaded process requires additional locking which slows the multithreaded version down, even to the point of making it slower than the non-multithreaded version.

You are talking about two different things here. Debug symbols and compiler optimization. If you use the strongest optimization settings the compiler has to offer, you do so at the consequence of losing symbols that are useful in debugging.

Your application is not running slower due to debugging symbols, its running slower because of less optimization done by the compiler.

Debugging symbols are not 'overhead' beyond the fact that they occupy more disk space. Code compiled at maximum optimization (-O3) should not be adding debug symbols. That's a flag that you would set when you have no need for said symbols.

If you need debugging symbols, you gain them at the expense of losing compiler optimization. However, once again, this is not 'overhead', its just the absence of compiler optimization.

Is the profile code inserting instrumentation calls in enough functions to hurt you?
If you single-step at the assembly language level, you'll find out pretty quick.

Multithreaded code execution time is not always measured as expected by gprof. You should time your code with an other timer in addition to gprof to see the difference.

My example: Running LULESH CORAL benchmark on a 2NUMA nodes INTEL sandy bridge (8 cores + 8 cores) with size -s 50 and 20 iterations -i, compile with gcc 6.3.0, -O3, I have:

With 1 thread running: ~3,7 without -pg and ~3,8 with it, but according to gprof analysis the code has ran only for 3,5.

WIth 16 threads running: ~0,6 without -pg and ~0,8 with it, but according to gprof analysis the code has ran for ~4,5 ...

The time in bold has been measured gettimeofday, outside the parallel region (start and end of main function).

Therefore, maybe if you would have measure your application time the same way, you would have seen the same speeduo with and without -pg. It is just the gprof measure which is wrong in parallel. In LULESH openmp version either way.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!