How to avoid less precise sum for numpy-arrays with multiple columns

ぐ巨炮叔叔 提交于 2019-12-01 21:28:52

I don't really have an explanation but it seems related to the memory layout. Using fortran order instead of the default C order I get the desired output.

>>> np.ones((N,2),dtype=np.float32, order='C').sum(axis=0)
array([16777216., 16777216.], dtype=float32)

>>> np.ones((N,2),dtype=np.float32, order='F').sum(axis=0)
array([17000000., 17000000.], dtype=float32)

This behavior is due to the way how numpy access memory during a reduce-operation ("add" being only a special case) in order to improve the utilization of cache.

For some cases (as the one above), one could enforce pairwise summation without big impact on performance. But in general, enforcing it would lead to massive performance loss - it might be easier to use double-precision which would mitigate the above issue in most cases.


Pairwise summation can be seen, as a very specific optimization for the "add"-operation, which is done if some constrains (more on this later) are met.

Summation (and many other reduce-operations) is memory-bandwidth bound. The life is good if we sum along a contiguous axis: The memory fetched into the cache for the index i will be directly reused for calculation with index i+1, i+2,... without being evicted from cache, prior to being used.

The situation is different, when the summation isn't along a contiguous axis: to add an float32-element 16-float32s are fetched into the cache, but 15 of them are evicted before they can be used, and must be fetched again - what a waste.

That is the reason, why numpy makes summation row-wise in this case: summing first and second rows, then adding third row to the result, then the fourth and so on. However, pairwise summation is only implemented for one-dimensional summation and cannot be used here.

The pairwise summation is performed, when:

  • sum is called on a one-dimensional numpy-array
  • sum is called along a contiguous axis

numpy doesn't (yet?) offer a way to enforce pairwise summation without a major negative impact on performance.

My take away from it: the goal should be to perform summation along contiguous axis, which is not only more precise, but also might be much faster:

A=np.ones((N,2), dtype=np.float32, order="C") #non-contiguous
%timeit A.sum(axis=0)
# 326 ms ± 9.17 ms

B=np.ones((N,2), dtype=np.float32, order="F") # contiguous
%timeit B.sum(axis=0)
# 15.6 ms ± 898 µs 

In this special case, with only 2 elements in a row, the overhead is just too big (see also similar behavior explained here).

It can be done better, for example via still imprecise einsum:

%timeit np.einsum("i...->...", A)
# 74.5 ms ± 1.47 ms 
np.einsum("i...->...", A)
# array([16777216.,  16777216.], dtype=float32)

or even:

%timeit np.array([A[:,0].sum(), A[:,1].sum()], dtype=np.float32)
# 17.8 ms ± 333 µs 
np.array([A[:,0].sum(), A[:,1].sum()], dtype=np.float32)
# array([17000000., 17000000.], dtype=float32)

which is not only almost as fast as the contiguous version (the penalty of loading the memory twice isn't as high as loading memory 16 times), but also precise, because sum is used for one-dimensional numpy-arrays.

For more columns, the difference to contiguous case is much smaller for numpy's and einsum-ways:

B=np.ones((N,16), dtype=np.float32, order="F")
%timeit B.sum(axis=0)
# 121 ms ± 3.66 ms 

A=np.ones((N,16), dtype=np.float32, order="C")
%timeit A.sum(axis=0)
# 457 ms ± 12.1 ms 

%timeit np.einsum("i...->...", A)
# 139 ms ± 651 µs per loop

But the performance is very bad for the "precise" trick, probably because latency can no longer be hidden by calculations:

def do(A):
    N=A.shape[1]
    res=np.zeros(N, dtype=np.float32)
    for i in range(N):
        res[i]=A[:,i].sum()
    return res
%timeit do(A)
# 1.39 s ± 47.8 ms

Here are the gory details of numpy's implementation.

The difference can be seen in code of FLOAT_add with defines from here:

#define IS_BINARY_REDUCE ((args[0] == args[2])\
    && (steps[0] == steps[2])\
    && (steps[0] == 0))

#define BINARY_REDUCE_LOOP(TYPE)\
   char *iop1 = args[0]; \
   TYPE io1 = *(TYPE *)iop1; \

/** (ip1, ip2) -> (op1) */
#define BINARY_LOOP\
    char *ip1 = args[0], *ip2 = args[1], *op1 = args[2];\
    npy_intp is1 = steps[0], is2 = steps[1], os1 = steps[2];\
    npy_intp n = dimensions[0];\
    npy_intp i;\
    for(i = 0; i < n; i++, ip1 += is1, ip2 += is2, op1 += os1)

/**begin repeat
* Float types
*  #type = npy_float, npy_double, npy_longdouble#
*  #TYPE = FLOAT, DOUBLE, LONGDOUBLE#
*  #c = f, , l#
*  #C = F, , L#
*/

/**begin repeat1
 * Arithmetic
 * # kind = add, subtract, multiply, divide#
 * # OP = +, -, *, /#
 * # PW = 1, 0, 0, 0#
 */
NPY_NO_EXPORT void
@TYPE@_@kind@(char **args, npy_intp *dimensions, npy_intp *steps, void *NPY_UNUSED(func))
{
    if (IS_BINARY_REDUCE) {
#if @PW@
        @type@ * iop1 = (@type@ *)args[0];
        npy_intp n = dimensions[0];

        *iop1 @OP@= pairwise_sum_@TYPE@(args[1], n, steps[1]);
#else
        BINARY_REDUCE_LOOP(@type@) {
            io1 @OP@= *(@type@ *)ip2;
        }
        *((@type@ *)iop1) = io1;
#endif
    }
    else if (!run_binary_simd_@kind@_@TYPE@(args, dimensions, steps)) {
        BINARY_LOOP {
            const @type@ in1 = *(@type@ *)ip1;
            const @type@ in2 = *(@type@ *)ip2;
            *((@type@ *)op1) = in1 @OP@ in2;
        }
    }
}

which once generated looks like follows:

NPY_NO_EXPORT void
FLOAT_add(char **args, npy_intp *dimensions, npy_intp *steps, void *NPY_UNUSED(func))
{
    if (IS_BINARY_REDUCE) {
#if 1
        npy_float * iop1 = (npy_float *)args[0];
        npy_intp n = dimensions[0];

        *iop1 += pairwise_sum_FLOAT((npy_float *)args[1], n,
                                        steps[1] / (npy_intp)sizeof(npy_float));
#else
        BINARY_REDUCE_LOOP(npy_float) {
            io1 += *(npy_float *)ip2;
        }
        *((npy_float *)iop1) = io1;
#endif
    }
    else if (!run_binary_simd_add_FLOAT(args, dimensions, steps)) {
        BINARY_LOOP {
            const npy_float in1 = *(npy_float *)ip1;
            const npy_float in2 = *(npy_float *)ip2;
            *((npy_float *)op1) = in1 + in2;
        }
    }
}

FLOAT_add can be used for one dimensional reduction, in this case:

  • args[0] is the pointer to the result/initial value (the same as args[2])
  • args[1] is the input array
  • steps[0] and steps[2] are 0, i.e. the pointers are to a scalar.

and then pairwise summation can be used (checked with IS_BINARY_REDUCE).

FLOAT_add can be used to add two vectors, in this case:

  • args[0] first input array
  • args[1] second input array
  • args[2] output array
  • steps - steps from one element to another in the array for the above arrays.

Parameter @PW@ is 1 only for summation - for all other operations, pairwise summation isn't used.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!