cProfile adds significant overhead when calling numba jit functions

Compare a pure Python no-op function with a no-op function decorated with @numba.jit, that is:

import numba

@numba.njit
def boring_numba():
    pass

def call_numba(x):
    for t in range(x):
        boring_numba()

def boring_normal():
    pass

def call_normal(x):
    for t in range(x):
        boring_normal()

If we time this with %timeit, we get the following:

%timeit call_numba(int(1e7))
792 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit call_normal(int(1e7))
737 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

All perfectly reasonable; there's a small overhead for the numba function, but not much.

If however we use cProfile to profile this code, we get the following:

cProfile.run('call_numba(int(1e7)); call_normal(int(1e7))', sort='cumulative')

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     76/1    0.003    0.000    8.670    8.670 {built-in method builtins.exec}
        1    6.613    6.613    7.127    7.127 experiments.py:10(call_numba)
        1    1.111    1.111    1.543    1.543 experiments.py:17(call_normal)
 10000000    0.432    0.000    0.432    0.000 experiments.py:14(boring_normal)
 10000000    0.428    0.000    0.428    0.000 experiments.py:6(boring_numba)
        1    0.000    0.000    0.086    0.086 dispatcher.py:72(compile)

cProfile thinks there is a massive overhead in calling the numba function. This extends to "real" code: I had a function that simply called my expensive computation (the computation being numba-JIT-compiled), and cProfile reported that the wrapper function was taking around a third of the total time.

I don't mind cProfile adding a bit of overhead, but if it's massively inconsistent about where it adds that overhead it's not very helpful. Does anyone know why this happens, whether there is anything that can be done about it, and/or if there are any alternative profiling tools that don't interact badly with numba?

When you create a numba function you actually create a numba Dispatcher object. This object "re-directs" a "call" to boring_numba to the correct (as far as types are concerned) internal "jitted" function. So even though you created a function called boring_numba - this function isn't called, what is called is a compiled function based on your function.

Just so you can see that the function boring_numba is called (even though it isn't, what is called is CPUDispatcher.__call__) during profiling the Dispatcher object needs to hook into the current thread state and check if there's a profiler/tracer running and if "yes" it makes it look like boring_numba is called.This last step is what incurs the overhead because it has to fake a "Python stack frame" for boring_numba.

A bit more technical:

When you call the numba function boring_numba it actually calls Dispatcher_Call which is a wrapper around call_cfunc and here is the major difference: When you have a profiler running the code dealing with a profiler makes up a majority of the function call (just compare the if (tstate->use_tracing && tstate->c_profilefunc) branch with the else branch that is running if there is no profiler/tracer):

static PyObject *
call_cfunc(DispatcherObject *self, PyObject *cfunc, PyObject *args, PyObject *kws, PyObject *locals)
{
    PyCFunctionWithKeywords fn;
    PyThreadState *tstate;
    assert(PyCFunction_Check(cfunc));
    assert(PyCFunction_GET_FLAGS(cfunc) == METH_VARARGS | METH_KEYWORDS);
    fn = (PyCFunctionWithKeywords) PyCFunction_GET_FUNCTION(cfunc);
    tstate = PyThreadState_GET();
    if (tstate->use_tracing && tstate->c_profilefunc)
    {
        /*
         * The following code requires some explaining:
         *
         * We want the jit-compiled function to be visible to the profiler, so we
         * need to synthesize a frame for it.
         * The PyFrame_New() constructor doesn't do anything with the 'locals' value if the 'code's
         * 'CO_NEWLOCALS' flag is set (which is always the case nowadays).
         * So, to get local variables into the frame, we have to manually set the 'f_locals'
         * member, then call `PyFrame_LocalsToFast`, where a subsequent call to the `frame.f_locals`
         * property (by virtue of the `frame_getlocals` function in frameobject.c) will find them.
         */
        PyCodeObject *code = (PyCodeObject*)PyObject_GetAttrString((PyObject*)self, "__code__");
        PyObject *globals = PyDict_New();
        PyObject *builtins = PyEval_GetBuiltins();
        PyFrameObject *frame = NULL;
        PyObject *result = NULL;

        if (!code) {
            PyErr_Format(PyExc_RuntimeError, "No __code__ attribute found.");
            goto error;
        }
        /* Populate builtins, which is required by some JITted functions */
        if (PyDict_SetItemString(globals, "__builtins__", builtins)) {
            goto error;
        }
        frame = PyFrame_New(tstate, code, globals, NULL);
        if (frame == NULL) {
            goto error;
        }
        /* Populate the 'fast locals' in `frame` */
        Py_XDECREF(frame->f_locals);
        frame->f_locals = locals;
        Py_XINCREF(frame->f_locals);
        PyFrame_LocalsToFast(frame, 0);
        tstate->frame = frame;
        C_TRACE(result, fn(PyCFunction_GET_SELF(cfunc), args, kws));
        tstate->frame = frame->f_back;

    error:
        Py_XDECREF(frame);
        Py_XDECREF(globals);
        Py_XDECREF(code);
        return result;
    }
    else
        return fn(PyCFunction_GET_SELF(cfunc), args, kws);
}

I assume that this extra code (in case a profiler is running) slows down the function when you're cProfile-ing.

It's a bit unfortunate that numba function add so much overhead when you run a profiler but that the slowdown will actually be almost negligible if you do anything substantial in the numba function. If you would also move the for loop in a numba function then even more so.

If you notice that the numba function (with or without profiler running) takes too much time then you probably call it too often. Then you should check if you can actually move the loop inside the numba function or wrap the code containing the loop in another numba function.

Note: All of this is (a bit) speculation, I haven't actually build numba with debug symbols and profiled the C-Code in case a profiler is running. However the amount of operations in case there ise a profiler running makes this seem very plausible. And all of this assumes numba 0.39, not sure if this applies to past versions as well.

来源：https://stackoverflow.com/questions/51384157/cprofile-adds-significant-overhead-when-calling-numba-jit-functions

标签

python

performance

profiling

numba

cprofile