In the past few days I\'ve been working on improving the runtime of a python function which requires many uses of the remainder function (%) among other things. My main test cas
My best hypothesis is that your NumPy install is using an unoptimized fmod
inside the %
calculation. Here's why.
First, I can't reproduce your results on a normal pip installed version of NumPy 1.15.1. I get only about a 10% performance difference (asdf.py contains your timing code):
$ python3.6 asdf.py
0.0006543657302856445
0.0006025806903839111
I can reproduce a major performance discrepancy with a manual build (python3.6 setup.py build_ext --inplace -j 4
) of v1.15.1 from a clone of the NumPy Git repository, though:
$ python3.6 asdf.py
0.00242799973487854
0.0006397026300430298
This suggests that my pip-installed build's %
is better optimized than my manual build or what you have installed.
Looking under the hood, it's tempting to look at the implementation of floating-point %
in NumPy and blame the slowdown on the unnecessary floordiv calculation (npy_divmod@c@
calculates both //
and %
):
NPY_NO_EXPORT void
@TYPE@_remainder(char **args, npy_intp *dimensions, npy_intp *steps, void *NPY_UNUSED(func))
{
BINARY_LOOP {
const @type@ in1 = *(@type@ *)ip1;
const @type@ in2 = *(@type@ *)ip2;
npy_divmod@c@(in1, in2, (@type@ *)op1);
}
}
but in my experiments, removing the floordiv provided no benefit. It looks easy enough for a compiler to optimize out, so maybe it was optimized out, or maybe it was just a negligible fraction of the runtime in the first place.
Rather than the floordiv, let's focus on just one line in npy_divmod@c@
, the fmod
call:
mod = npy_fmod@c@(a, b);
This is the initial remainder computation, before special-case handling and adjusting the result to match the sign of the right-hand operand. If we compare the performance of %
with numpy.fmod
on my manual build:
>>> import timeit
>>> import numpy
>>> a = numpy.arange(1, 8000, dtype=float)
>>> timeit.timeit('a % 3', globals=globals(), number=1000)
0.3510419335216284
>>> timeit.timeit('numpy.fmod(a, 3)', globals=globals(), number=1000)
0.33593094255775213
>>> timeit.timeit('a - 3*numpy.floor(a/3)', globals=globals(), number=1000)
0.07980139832943678
We see that fmod
appears to be responsible for almost the entire runtime of %
.
I haven't disassembled the generated binary or stepped through it in an instruction-level debugger to see exactly what gets executed, and of course, I don't have access to your machine or your copy of NumPy. Still, from the above evidence, fmod
seems like a pretty likely culprit.