400 threads in 20 processes outperform 400 threads in 4 processes while performing a CPU-bound task on 4 CPUs

冷暖自知 提交于 2019-12-06 09:29:35
Martin

Before a Python thread can execute code it needs to acquire the Global Interpreter Lock (GIL). This is a per-process lock. In some cases (e.g. when waiting for I/O operations to complete) a thread will routinely release the GIL so other threads can acquire it. If the active thread is not giving up the lock within a certain time other threads can signal the active thread to release the GIL so they can take turns.

With that in mind let's look at how your code performs on my 4 core laptop:

  1. In the simplest case (1 process with 1 thread) I get ~155 tasks/s. The GIL is not getting in our way here. We use 100% of one core.

  2. If I bump up the number of threads (1 process with 4 threads), I get ~70 tasks/s. This might be counter-intuitive at first but can be explained by the fact that your code is mostly CPU-bound so all threads need the GIL pretty much all the time. Only one of them can run it's computation at a time so we don't benefit from multithreading. The result is that we use ~25% of each of my 4 cores. To make matters worse acquiring and releasing the GIL as well as context switching add significant overhead that bring down overall performance.

  3. Adding more threads (1 process with 400 threads) doesn't help since only one of them gets executed at a time. On my laptop performance is pretty similar to case (2), again we use ~25% of each of my 4 cores.

  4. With 4 processes with 1 thread each, I get ~550 tasks/s. Almost 4 times what I got in case (1). Actually, a little bit less due to overhead required for inter-process communication and locking on the shared queue. Note that each process is using its own GIL.

  5. With 4 processes running 100 threads each, I get ~290 tasks/s. Again the we see the slow-down we saw in (2), this time affecting each separate process.

  6. With 400 processes running 1 thread each, I get ~530 tasks/s. Compared to (4) we see additional overhead due to inter-process communication and locking on the shared queue.

Please refer to David Beazley's talk Understanding the Python GIL for a more in-depth explanation of these effects.

Note: Some Python interpreters like CPython and PyPy have a GIL while others like Jython and IronPython don't. If you use another Python interpreter you might see very different behavior.

Threads in Python do not execute in parallel because of the infamous global interpreter lock:

In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once.

This is why one thread per process performs best in your benchmarks.

Avoid using threading.Thread if truly parallel execution is important.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!