400 threads in 20 processes outperform 400 threads in 4 processes while performing a CPU-bound task on 4 CPUs

This question is very similar to 400 threads in 20 processes outperform 400 threads in 4 processes while performing an I/O-bound task. The only difference is that the linked question is about an I/O-bound task whereas this question is about a CPU-bound task.

Experimental Code

Here is the experimental code that can launch a specified number of worker processes and then launch a specified number of worker threads within each process and perform the task of computing the n-th prime number.

import math
import multiprocessing
import random
import sys
import time
import threading

def main():
    processes = int(sys.argv[1])
    threads = int(sys.argv[2])
    tasks = int(sys.argv[3])

    # Start workers.
    in_q = multiprocessing.Queue()
    process_workers = []
    for _ in range(processes):
        w = multiprocessing.Process(target=process_worker, args=(threads, in_q))
        w.start()
        process_workers.append(w)

    start_time = time.time()

    # Feed work.
    for nth in range(1, tasks + 1):
        in_q.put(nth)

    # Send sentinel for each thread worker to quit.
    for _ in range(processes * threads):
        in_q.put(None)

    # Wait for workers to terminate.
    for w in process_workers:
        w.join()

    total_time = time.time() - start_time
    task_speed = tasks / total_time

    print('{:3d} x {:3d} workers => {:6.3f} s, {:5.1f} tasks/s'
          .format(processes, threads, total_time, task_speed))



def process_worker(threads, in_q):
    thread_workers = []
    for _ in range(threads):
        w = threading.Thread(target=thread_worker, args=(in_q,))
        w.start()
        thread_workers.append(w)

    for w in thread_workers:
        w.join()


def thread_worker(in_q):
    while True:
        nth = in_q.get()
        if nth is None:
            break
        num = find_nth_prime(nth)
        #print(num)


def find_nth_prime(nth):
    # Find n-th prime from scratch.
    if nth == 0:
        return

    count = 0
    num = 2
    while True:
        if is_prime(num):
            count += 1

        if count == nth:
            return num

        num += 1


def is_prime(num):
    for i in range(2, int(math.sqrt(num)) + 1):
        if num % i == 0:
            return False
    return True


if __name__ == '__main__':
    main()

Here is how I run this program:

python3 foo.py <PROCESSES> <THREADS> <TASKS>

For example, python3 foo.py 20 20 2000 creates 20 worker processes with 20 threads in each worker process (thus a total of 400 worker threads) and performs 2000 tasks. In the end, this program prints how much time it took to perform the tasks and how many tasks it did per second on an average.

Environment

I am testing this code on a Linode virtual private server that has 8 GB RAM and 4 CPUs. It is running Debian 9.

$ cat /etc/debian_version 
9.9

$ python3
Python 3.5.3 (default, Sep 27 2018, 17:25:39) 
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7987          67        7834          10          85        7734
Swap:           511           0         511

$ nproc
4

Case 1: 20 Processes x 20 Threads

Here are a few trial runs with 400 worker threads distributed between 20 worker processes (i.e., 20 worker threads in each of the 20 worker processes).

Here are the results:

$ python3 bar.py 20 20 2000
 20 x  20 workers => 12.702 s, 157.5 tasks/s

$ python3 bar.py 20 20 2000
 20 x  20 workers => 13.196 s, 151.6 tasks/s

$ python3 bar.py 20 20 2000
 20 x  20 workers => 12.224 s, 163.6 tasks/s

$ python3 bar.py 20 20 2000
 20 x  20 workers => 11.725 s, 170.6 tasks/s

$ python3 bar.py 20 20 2000
 20 x  20 workers => 10.813 s, 185.0 tasks/s

When I monitor the CPU usage with the top command, I see that each python3 worker process consumes about 15% to 25% CPU.

Case 2: 4 Processes x 100 Threads

Now I thought that I only have 4 CPUs. Even if I launch 20 worker processes, at most only 4 processes can run at any point in physical time. Further due to global interpreter lock (GIL), only one thread in each process (thus a total of 4 threads at most) can run at any point in physical time.

Therefore, I thought if I reduce the number of processes to 4 and increase the number of threads per process to 100, so that the total number of threads still remain 400, the performance should not deteriorate.

But the test results show that 4 processes containing 100 threads each consistently perform worse than 20 processes containing 20 threads each.

$ python3 bar.py 4 100 2000
  4 x 100 workers => 19.840 s, 100.8 tasks/s

$ python3 bar.py 4 100 2000
  4 x 100 workers => 22.716 s,  88.0 tasks/s

$ python3 bar.py 4 100 2000
  4 x 100 workers => 20.278 s,  98.6 tasks/s

$ python3 bar.py 4 100 2000
  4 x 100 workers => 19.896 s, 100.5 tasks/s

$ python3 bar.py 4 100 2000
  4 x 100 workers => 19.876 s, 100.6 tasks/s

The CPU usage is between 50% to 66% for each python3 worker process.

Case 3: 1 Process x 400 Threads

Just for comparison, I am recording the fact that both case 1 and case 2 outperform the case where we have all 400 threads in a single process. This is obviously due to the global interpreter lock (GIL).

$ python3 bar.py 1 400 2000
  1 x 400 workers => 34.762 s,  57.5 tasks/s

$ python3 bar.py 1 400 2000
  1 x 400 workers => 35.276 s,  56.7 tasks/s

$ python3 bar.py 1 400 2000
  1 x 400 workers => 32.589 s,  61.4 tasks/s

$ python3 bar.py 1 400 2000
  1 x 400 workers => 33.974 s,  58.9 tasks/s

$ python3 bar.py 1 400 2000
  1 x 400 workers => 35.429 s,  56.5 tasks/s

The CPU usage is between 110% and 115% for the single python3 worker process.

Case 4: 400 Processes x 1 Thread

Again, just for comparison, here is how the results look when there are 400 processes, each with a single thread.

$ python3 bar.py 400 1 2000
400 x   1 workers =>  8.814 s, 226.9 tasks/s

$ python3 bar.py 400 1 2000
400 x   1 workers =>  8.631 s, 231.7 tasks/s

$ python3 bar.py 400 1 2000
400 x   1 workers => 10.453 s, 191.3 tasks/s

$ python3 bar.py 400 1 2000
400 x   1 workers =>  8.234 s, 242.9 tasks/s

$ python3 bar.py 400 1 2000
400 x   1 workers =>  8.324 s, 240.3 tasks/s

The CPU usage is between 1% to 3% for each python3 worker process.

Summary

Picking the median result from each case, we get this summary:

Case 1:  20 x  20 workers => 12.224 s, 163.6 tasks/s
Case 2:   4 x 100 workers => 19.896 s, 100.5 tasks/s
Case 3:   1 x 400 workers => 34.762 s,  57.5 tasks/s
Case 4: 400 x   1 workers =>  8.631 s, 231.7 tasks/s

Question

Why does 20 processes x 20 threads perform better than 4 processes x 100 threads even if I have only 4 CPUs?

In fact, 400 processes x 1 thread performs the best despite the presence of only 4 CPUs? Why?

Martin

Before a Python thread can execute code it needs to acquire the Global Interpreter Lock (GIL). This is a per-process lock. In some cases (e.g. when waiting for I/O operations to complete) a thread will routinely release the GIL so other threads can acquire it. If the active thread is not giving up the lock within a certain time other threads can signal the active thread to release the GIL so they can take turns.

With that in mind let's look at how your code performs on my 4 core laptop:

In the simplest case (1 process with 1 thread) I get ~155 tasks/s. The GIL is not getting in our way here. We use 100% of one core.
If I bump up the number of threads (1 process with 4 threads), I get ~70 tasks/s. This might be counter-intuitive at first but can be explained by the fact that your code is mostly CPU-bound so all threads need the GIL pretty much all the time. Only one of them can run it's computation at a time so we don't benefit from multithreading. The result is that we use ~25% of each of my 4 cores. To make matters worse acquiring and releasing the GIL as well as context switching add significant overhead that bring down overall performance.
Adding more threads (1 process with 400 threads) doesn't help since only one of them gets executed at a time. On my laptop performance is pretty similar to case (2), again we use ~25% of each of my 4 cores.
With 4 processes with 1 thread each, I get ~550 tasks/s. Almost 4 times what I got in case (1). Actually, a little bit less due to overhead required for inter-process communication and locking on the shared queue. Note that each process is using its own GIL.
With 4 processes running 100 threads each, I get ~290 tasks/s. Again the we see the slow-down we saw in (2), this time affecting each separate process.
With 400 processes running 1 thread each, I get ~530 tasks/s. Compared to (4) we see additional overhead due to inter-process communication and locking on the shared queue.

Please refer to David Beazley's talk Understanding the Python GIL for a more in-depth explanation of these effects.

Note: Some Python interpreters like CPython and PyPy have a GIL while others like Jython and IronPython don't. If you use another Python interpreter you might see very different behavior.

Threads in Python do not execute in parallel because of the infamous global interpreter lock:

In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once.

This is why one thread per process performs best in your benchmarks.

Avoid using threading.Thread if truly parallel execution is important.

来源：https://stackoverflow.com/questions/56274688/400-threads-in-20-processes-outperform-400-threads-in-4-processes-while-performi

标签

python

multithreading

performance

multiprocessing

gil