Numba `nogil=True` + ThreadPoolExecutor results in smaller speed up than expected

问题

This is a follow-up to my previous question:

I'm trying to use Numba and Dask to speed up a slow computation that is similar to calculating the kernel density estimate of a huge collection of points. My plan was to write the computationally expensive logic in a jited function and then split the work among the CPU cores using dask. I wanted to use the nogil feature of numba.jit function so that I could use the dask threading backend so as to avoid unnecessary memory copies of the input data (which is very large).

It turns out that part of the problem has to do with unpacking arguments in a jited function. However, even with that fix I see little to no speed up of the following code:

import os
import time
import numpy as np
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed
from numba import njit, jit

CPU_COUNT = os.cpu_count()
print("CPU_COUNT", CPU_COUNT)

def PE(pool, func, args):
    with pool(max_workers=CPU_COUNT) as exc:
        fut = {exc.submit(func, *arg): i for i, arg in enumerate(args)}
        for f in as_completed(fut):
            f.result()

def render(params, mag):
    """Render gaussian peaks in small windows"""

    radius = 3

    for i in range(len(params)):
        y0 = params[i, 0] * mag
        x0 = params[i, 1] * mag
        sy = params[i, 2] * mag
        sx = params[i, 3] * mag

        # calculate the render window size
        wy = int(sy * radius * 2.0)
        wx = int(sx * radius * 2.0)

        # calculate the area in the image
        ystart = int(np.rint(y0)) - wy // 2
        yend = ystart + wy
        xstart = int(np.rint(x0)) - wx // 2
        xend = xstart + wx

        # adjust coordinates to window coordinates
        y1 = y0 - ystart
        x1 = x0 - xstart

        y = np.arange(wy)
        x = np.arange(wx)
        amp = 1 / (2 * np.pi * sy * sx)
        gy = np.exp(-((y - y0) / sy) ** 2 / 2)
        gx = np.exp(-((x - x0) / sx) ** 2 / 2)
        g = amp * np.outer(gy, gx)

jit_render = jit(render, nopython=True, nogil=True)

args = [(np.random.rand(1000000, 4) * (1, 1, 0.02, 0.02), 100) for i in range(CPU_COUNT)]

print("Single time:")
# %timeit render(*args[0])
%timeit jit_render(*args[0])

print()
print("Linear time:")
%time [jit_render(*a) for a in args]

print()
print("Threads time:")
%time PE(ThreadPoolExecutor, jit_render, args)

On my 8 core MacBook I get a speed up of about 2X

Single time:
1.6 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Linear time:
CPU times: user 11.8 s, sys: 43.1 ms, total: 11.9 s
Wall time: 11.9 s

Threads time:
CPU times: user 45.4 s, sys: 125 ms, total: 45.5 s
Wall time: 6.29 s

On my 24 core Windows box I get a speed up of about 1X:

Single time:
1.91 s ± 105 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Linear time:
Wall time: 1min 30s

Threads time:
Wall time: 1min 4s

来源：https://stackoverflow.com/questions/56926880/numba-nogil-true-threadpoolexecutor-results-in-smaller-speed-up-than-expecte

标签

python

multithreading

numba

gil