Remainder function (%) runtime on numpy arrays is far longer than manual remainder calculation

前端未结

关注

 1  1666

In the past few days I\'ve been working on improving the runtime of a python function which requires many uses of the remainder function (%) among other things. My main test cas

相关标签:

1条回答

孤独总比滥情好

2021-01-31 09:32
My best hypothesis is that your NumPy install is using an unoptimized fmod inside the % calculation. Here's why.

First, I can't reproduce your results on a normal pip installed version of NumPy 1.15.1. I get only about a 10% performance difference (asdf.py contains your timing code):
```
$ python3.6 asdf.py
0.0006543657302856445
0.0006025806903839111
```
I can reproduce a major performance discrepancy with a manual build (python3.6 setup.py build_ext --inplace -j 4) of v1.15.1 from a clone of the NumPy Git repository, though:
```
$ python3.6 asdf.py
0.00242799973487854
0.0006397026300430298
```
This suggests that my pip-installed build's % is better optimized than my manual build or what you have installed.

Looking under the hood, it's tempting to look at the implementation of floating-point % in NumPy and blame the slowdown on the unnecessary floordiv calculation (npy_divmod@c@ calculates both // and %):
```
NPY_NO_EXPORT void
@TYPE@_remainder(char **args, npy_intp *dimensions, npy_intp *steps, void *NPY_UNUSED(func))
{
    BINARY_LOOP {
        const @type@ in1 = *(@type@ *)ip1;
        const @type@ in2 = *(@type@ *)ip2;
        npy_divmod@c@(in1, in2, (@type@ *)op1);
    }
}
```
but in my experiments, removing the floordiv provided no benefit. It looks easy enough for a compiler to optimize out, so maybe it was optimized out, or maybe it was just a negligible fraction of the runtime in the first place.

Rather than the floordiv, let's focus on just one line in npy_divmod@c@, the fmod call:
```
mod = npy_fmod@c@(a, b);
```
This is the initial remainder computation, before special-case handling and adjusting the result to match the sign of the right-hand operand. If we compare the performance of % with numpy.fmod on my manual build:
```
>>> import timeit
>>> import numpy
>>> a = numpy.arange(1, 8000, dtype=float)
>>> timeit.timeit('a % 3', globals=globals(), number=1000)
0.3510419335216284
>>> timeit.timeit('numpy.fmod(a, 3)', globals=globals(), number=1000)
0.33593094255775213
>>> timeit.timeit('a - 3*numpy.floor(a/3)', globals=globals(), number=1000)
0.07980139832943678
```
We see that fmod appears to be responsible for almost the entire runtime of %.

I haven't disassembled the generated binary or stepped through it in an instruction-level debugger to see exactly what gets executed, and of course, I don't have access to your machine or your copy of NumPy. Still, from the above evidence, fmod seems like a pretty likely culprit.
0 讨论(0)
发布评论:

提交评论
- 加载中...