Numba `nogil=True` + ThreadPoolExecutor results in smaller speed up than expected

梦想的初衷 提交于 2019-12-13 02:49:16

问题


This is a follow-up to my previous question:

I'm trying to use Numba and Dask to speed up a slow computation that is similar to calculating the kernel density estimate of a huge collection of points. My plan was to write the computationally expensive logic in a jited function and then split the work among the CPU cores using dask. I wanted to use the nogil feature of numba.jit function so that I could use the dask threading backend so as to avoid unnecessary memory copies of the input data (which is very large).

It turns out that part of the problem has to do with unpacking arguments in a jited function. However, even with that fix I see little to no speed up of the following code:

import os
import time
import numpy as np
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed
from numba import njit, jit

CPU_COUNT = os.cpu_count()
print("CPU_COUNT", CPU_COUNT)

def PE(pool, func, args):
    with pool(max_workers=CPU_COUNT) as exc:
        fut = {exc.submit(func, *arg): i for i, arg in enumerate(args)}
        for f in as_completed(fut):
            f.result()

def render(params, mag):
    """Render gaussian peaks in small windows"""

    radius = 3

    for i in range(len(params)):
        y0 = params[i, 0] * mag
        x0 = params[i, 1] * mag
        sy = params[i, 2] * mag
        sx = params[i, 3] * mag

        # calculate the render window size
        wy = int(sy * radius * 2.0)
        wx = int(sx * radius * 2.0)

        # calculate the area in the image
        ystart = int(np.rint(y0)) - wy // 2
        yend = ystart + wy
        xstart = int(np.rint(x0)) - wx // 2
        xend = xstart + wx

        # adjust coordinates to window coordinates
        y1 = y0 - ystart
        x1 = x0 - xstart

        y = np.arange(wy)
        x = np.arange(wx)
        amp = 1 / (2 * np.pi * sy * sx)
        gy = np.exp(-((y - y0) / sy) ** 2 / 2)
        gx = np.exp(-((x - x0) / sx) ** 2 / 2)
        g = amp * np.outer(gy, gx)

jit_render = jit(render, nopython=True, nogil=True)

args = [(np.random.rand(1000000, 4) * (1, 1, 0.02, 0.02), 100) for i in range(CPU_COUNT)]

print("Single time:")
# %timeit render(*args[0])
%timeit jit_render(*args[0])

print()
print("Linear time:")
%time [jit_render(*a) for a in args]

print()
print("Threads time:")
%time PE(ThreadPoolExecutor, jit_render, args)

On my 8 core MacBook I get a speed up of about 2X

Single time:
1.6 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Linear time:
CPU times: user 11.8 s, sys: 43.1 ms, total: 11.9 s
Wall time: 11.9 s

Threads time:
CPU times: user 45.4 s, sys: 125 ms, total: 45.5 s
Wall time: 6.29 s

On my 24 core Windows box I get a speed up of about 1X:

Single time:
1.91 s ± 105 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Linear time:
Wall time: 1min 30s

Threads time:
Wall time: 1min 4s

来源:https://stackoverflow.com/questions/56926880/numba-nogil-true-threadpoolexecutor-results-in-smaller-speed-up-than-expecte

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!