cProfile adds significant overhead when calling numba jit functions

北城以北 提交于 2019-12-04 00:05:03

When you create a numba function you actually create a numba Dispatcher object. This object "re-directs" a "call" to boring_numba to the correct (as far as types are concerned) internal "jitted" function. So even though you created a function called boring_numba - this function isn't called, what is called is a compiled function based on your function.

Just so you can see that the function boring_numba is called (even though it isn't, what is called is CPUDispatcher.__call__) during profiling the Dispatcher object needs to hook into the current thread state and check if there's a profiler/tracer running and if "yes" it makes it look like boring_numba is called.This last step is what incurs the overhead because it has to fake a "Python stack frame" for boring_numba.

A bit more technical:

When you call the numba function boring_numba it actually calls Dispatcher_Call which is a wrapper around call_cfunc and here is the major difference: When you have a profiler running the code dealing with a profiler makes up a majority of the function call (just compare the if (tstate->use_tracing && tstate->c_profilefunc) branch with the else branch that is running if there is no profiler/tracer):

static PyObject *
call_cfunc(DispatcherObject *self, PyObject *cfunc, PyObject *args, PyObject *kws, PyObject *locals)
{
    PyCFunctionWithKeywords fn;
    PyThreadState *tstate;
    assert(PyCFunction_Check(cfunc));
    assert(PyCFunction_GET_FLAGS(cfunc) == METH_VARARGS | METH_KEYWORDS);
    fn = (PyCFunctionWithKeywords) PyCFunction_GET_FUNCTION(cfunc);
    tstate = PyThreadState_GET();
    if (tstate->use_tracing && tstate->c_profilefunc)
    {
        /*
         * The following code requires some explaining:
         *
         * We want the jit-compiled function to be visible to the profiler, so we
         * need to synthesize a frame for it.
         * The PyFrame_New() constructor doesn't do anything with the 'locals' value if the 'code's
         * 'CO_NEWLOCALS' flag is set (which is always the case nowadays).
         * So, to get local variables into the frame, we have to manually set the 'f_locals'
         * member, then call `PyFrame_LocalsToFast`, where a subsequent call to the `frame.f_locals`
         * property (by virtue of the `frame_getlocals` function in frameobject.c) will find them.
         */
        PyCodeObject *code = (PyCodeObject*)PyObject_GetAttrString((PyObject*)self, "__code__");
        PyObject *globals = PyDict_New();
        PyObject *builtins = PyEval_GetBuiltins();
        PyFrameObject *frame = NULL;
        PyObject *result = NULL;

        if (!code) {
            PyErr_Format(PyExc_RuntimeError, "No __code__ attribute found.");
            goto error;
        }
        /* Populate builtins, which is required by some JITted functions */
        if (PyDict_SetItemString(globals, "__builtins__", builtins)) {
            goto error;
        }
        frame = PyFrame_New(tstate, code, globals, NULL);
        if (frame == NULL) {
            goto error;
        }
        /* Populate the 'fast locals' in `frame` */
        Py_XDECREF(frame->f_locals);
        frame->f_locals = locals;
        Py_XINCREF(frame->f_locals);
        PyFrame_LocalsToFast(frame, 0);
        tstate->frame = frame;
        C_TRACE(result, fn(PyCFunction_GET_SELF(cfunc), args, kws));
        tstate->frame = frame->f_back;

    error:
        Py_XDECREF(frame);
        Py_XDECREF(globals);
        Py_XDECREF(code);
        return result;
    }
    else
        return fn(PyCFunction_GET_SELF(cfunc), args, kws);
}

I assume that this extra code (in case a profiler is running) slows down the function when you're cProfile-ing.

It's a bit unfortunate that numba function add so much overhead when you run a profiler but that the slowdown will actually be almost negligible if you do anything substantial in the numba function. If you would also move the for loop in a numba function then even more so.

If you notice that the numba function (with or without profiler running) takes too much time then you probably call it too often. Then you should check if you can actually move the loop inside the numba function or wrap the code containing the loop in another numba function.

Note: All of this is (a bit) speculation, I haven't actually build numba with debug symbols and profiled the C-Code in case a profiler is running. However the amount of operations in case there ise a profiler running makes this seem very plausible. And all of this assumes numba 0.39, not sure if this applies to past versions as well.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!