问题
I'm currently implementing Tensorflow custom op(for custom data fetcher) using C++ in order to speed up my Tensorflow model. Since my Tensorflow model doesn't use GPU a lot, I believe I can achieve maximal performance using multiple worker threads concurrently.
The problem is, even though I have enough workers, my program doesn't utilize all CPU. In my development machine, (4 physical core) it uses about 90% of user time, 4% of sys time with 4 worker threads and tf.ConfigProto(inter_op_parallelism_threads=6)
options.
With more worker threads and inter_op_parallelism_threads
options, I get much worse model running performance than previous configuration. Since I don't good at prpfiling I don't know where is the bottleneck of my code.
Is there any rule of thumbs to maximize CPU usage and/or good tools to find performance bottleneck/mutex lock for single process(not system-wide) in Linux?
EDIT: My code runs python, but (almost) every executions are in C++ code. Some of them are not mine(Tensorflow and and Eigen), and I've made a shared library that can be dynamically loaded in Python and it is being called by Tensorflow kernel. Tensorflow owns their thread pool and my dynamic library code also owns thread pool, and my code is thread safe. I also create threads to call sess.run()
concurrently in order to call them. Like Python can call multiple HTTP requests concurrently, sess.run()
release GIL. My object is call sess.run()
as much as possible to increase "real" performance, and any python-related profiler wasn't succesful.
回答1:
1) More threads does not mean more speed. If you have 4 cores, you cannot go any faster than 4 times 1 core.
2) What you should do is tune your code for maximum performance in single-thread execution (with compiler optimization turned off), and after you have done that, turn on the compiler's optimizer and make the code multi-threaded, with no more threads than you have cores.
P.S. It is a common misconception that performance tuning can only be done on compiler-optimized code. This explains why it's not so.
来源:https://stackoverflow.com/questions/40218075/multithreading-how-to-use-cpu-as-much-as-possible