Remainder function (%) runtime on numpy arrays is far longer than manual remainder calculation

前端 未结 1 1666
礼貌的吻别
礼貌的吻别 2021-01-31 08:51

In the past few days I\'ve been working on improving the runtime of a python function which requires many uses of the remainder function (%) among other things. My main test cas

相关标签:
1条回答
  • 2021-01-31 09:32

    My best hypothesis is that your NumPy install is using an unoptimized fmod inside the % calculation. Here's why.


    First, I can't reproduce your results on a normal pip installed version of NumPy 1.15.1. I get only about a 10% performance difference (asdf.py contains your timing code):

    $ python3.6 asdf.py
    0.0006543657302856445
    0.0006025806903839111
    

    I can reproduce a major performance discrepancy with a manual build (python3.6 setup.py build_ext --inplace -j 4) of v1.15.1 from a clone of the NumPy Git repository, though:

    $ python3.6 asdf.py
    0.00242799973487854
    0.0006397026300430298
    

    This suggests that my pip-installed build's % is better optimized than my manual build or what you have installed.


    Looking under the hood, it's tempting to look at the implementation of floating-point % in NumPy and blame the slowdown on the unnecessary floordiv calculation (npy_divmod@c@ calculates both // and %):

    NPY_NO_EXPORT void
    @TYPE@_remainder(char **args, npy_intp *dimensions, npy_intp *steps, void *NPY_UNUSED(func))
    {
        BINARY_LOOP {
            const @type@ in1 = *(@type@ *)ip1;
            const @type@ in2 = *(@type@ *)ip2;
            npy_divmod@c@(in1, in2, (@type@ *)op1);
        }
    }
    

    but in my experiments, removing the floordiv provided no benefit. It looks easy enough for a compiler to optimize out, so maybe it was optimized out, or maybe it was just a negligible fraction of the runtime in the first place.

    Rather than the floordiv, let's focus on just one line in npy_divmod@c@, the fmod call:

    mod = npy_fmod@c@(a, b);
    

    This is the initial remainder computation, before special-case handling and adjusting the result to match the sign of the right-hand operand. If we compare the performance of % with numpy.fmod on my manual build:

    >>> import timeit
    >>> import numpy
    >>> a = numpy.arange(1, 8000, dtype=float)
    >>> timeit.timeit('a % 3', globals=globals(), number=1000)
    0.3510419335216284
    >>> timeit.timeit('numpy.fmod(a, 3)', globals=globals(), number=1000)
    0.33593094255775213
    >>> timeit.timeit('a - 3*numpy.floor(a/3)', globals=globals(), number=1000)
    0.07980139832943678
    

    We see that fmod appears to be responsible for almost the entire runtime of %.


    I haven't disassembled the generated binary or stepped through it in an instruction-level debugger to see exactly what gets executed, and of course, I don't have access to your machine or your copy of NumPy. Still, from the above evidence, fmod seems like a pretty likely culprit.

    0 讨论(0)
提交回复
热议问题