问题
This is a follow-up to my previous question:
I'm trying to use Numba and Dask to speed up a slow computation that is similar to calculating the kernel density estimate of a huge collection of points. My plan was to write the computationally expensive logic in a
jit
ed function and then split the work among the CPU cores usingdask
. I wanted to use thenogil
feature ofnumba.jit
function so that I could use thedask
threading backend so as to avoid unnecessary memory copies of the input data (which is very large).
It turns out that part of the problem has to do with unpacking arguments in a jit
ed function. However, even with that fix I see little to no speed up of the following code:
import os
import time
import numpy as np
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed
from numba import njit, jit
CPU_COUNT = os.cpu_count()
print("CPU_COUNT", CPU_COUNT)
def PE(pool, func, args):
with pool(max_workers=CPU_COUNT) as exc:
fut = {exc.submit(func, *arg): i for i, arg in enumerate(args)}
for f in as_completed(fut):
f.result()
def render(params, mag):
"""Render gaussian peaks in small windows"""
radius = 3
for i in range(len(params)):
y0 = params[i, 0] * mag
x0 = params[i, 1] * mag
sy = params[i, 2] * mag
sx = params[i, 3] * mag
# calculate the render window size
wy = int(sy * radius * 2.0)
wx = int(sx * radius * 2.0)
# calculate the area in the image
ystart = int(np.rint(y0)) - wy // 2
yend = ystart + wy
xstart = int(np.rint(x0)) - wx // 2
xend = xstart + wx
# adjust coordinates to window coordinates
y1 = y0 - ystart
x1 = x0 - xstart
y = np.arange(wy)
x = np.arange(wx)
amp = 1 / (2 * np.pi * sy * sx)
gy = np.exp(-((y - y0) / sy) ** 2 / 2)
gx = np.exp(-((x - x0) / sx) ** 2 / 2)
g = amp * np.outer(gy, gx)
jit_render = jit(render, nopython=True, nogil=True)
args = [(np.random.rand(1000000, 4) * (1, 1, 0.02, 0.02), 100) for i in range(CPU_COUNT)]
print("Single time:")
# %timeit render(*args[0])
%timeit jit_render(*args[0])
print()
print("Linear time:")
%time [jit_render(*a) for a in args]
print()
print("Threads time:")
%time PE(ThreadPoolExecutor, jit_render, args)
On my 8 core MacBook I get a speed up of about 2X
Single time:
1.6 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Linear time:
CPU times: user 11.8 s, sys: 43.1 ms, total: 11.9 s
Wall time: 11.9 s
Threads time:
CPU times: user 45.4 s, sys: 125 ms, total: 45.5 s
Wall time: 6.29 s
On my 24 core Windows box I get a speed up of about 1X:
Single time:
1.91 s ± 105 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Linear time:
Wall time: 1min 30s
Threads time:
Wall time: 1min 4s
来源:https://stackoverflow.com/questions/56926880/numba-nogil-true-threadpoolexecutor-results-in-smaller-speed-up-than-expecte