How to avoid less precise sum for numpy-arrays with multiple columns

I've always assumed, that numpy uses a kind of pairwise-summation, which ensures high precision also for float32 - operations:

import numpy as np
N=17*10**6  # float32-precision no longer enough to hold the whole sum
print(np.ones((N,1),dtype=np.float32).sum(axis=0))
# [17000000.], kind of expected

However, it looks as if a different algorithm is used if the matrix has more than one column:

print(np.ones((N,2),dtype=np.float32).sum(axis=0))
# [16777216. 16777216.] the error is just to big
print(np.ones((2*N,2),dtype=np.float32).sum(axis=0))
# [16777216. 16777216.] error is bigger

Probably sum just sums all values naively. An indication is that 16777216.f+1.0f=16777216.f, e.g.:

one = np.array([1.], np.float32)
print(np.array([16777215.], np.float32)+one)  # 16777216.
print(np.array([16777216.], np.float32)+one)  # 16777216. as well

Why doesn't numpy uses the pairwise-summation for multiple columns and can numpy be forced to use pairwise summation also for multiple columns?

My numpy version is 1.14.2, if this play a role.

I don't really have an explanation but it seems related to the memory layout. Using fortran order instead of the default C order I get the desired output.

>>> np.ones((N,2),dtype=np.float32, order='C').sum(axis=0)
array([16777216., 16777216.], dtype=float32)

>>> np.ones((N,2),dtype=np.float32, order='F').sum(axis=0)
array([17000000., 17000000.], dtype=float32)

This behavior is due to the way how numpy access memory during a reduce-operation ("add" being only a special case) in order to improve the utilization of cache.

For some cases (as the one above), one could enforce pairwise summation without big impact on performance. But in general, enforcing it would lead to massive performance loss - it might be easier to use double-precision which would mitigate the above issue in most cases.

Pairwise summation can be seen, as a very specific optimization for the "add"-operation, which is done if some constrains (more on this later) are met.

Summation (and many other reduce-operations) is memory-bandwidth bound. The life is good if we sum along a contiguous axis: The memory fetched into the cache for the index i will be directly reused for calculation with index i+1, i+2,... without being evicted from cache, prior to being used.

The situation is different, when the summation isn't along a contiguous axis: to add an float32-element 16-float32s are fetched into the cache, but 15 of them are evicted before they can be used, and must be fetched again - what a waste.

That is the reason, why numpy makes summation row-wise in this case: summing first and second rows, then adding third row to the result, then the fourth and so on. However, pairwise summation is only implemented for one-dimensional summation and cannot be used here.

The pairwise summation is performed, when:

sum is called on a one-dimensional numpy-array
sum is called along a contiguous axis

numpy doesn't (yet?) offer a way to enforce pairwise summation without a major negative impact on performance.

My take away from it: the goal should be to perform summation along contiguous axis, which is not only more precise, but also might be much faster:

A=np.ones((N,2), dtype=np.float32, order="C") #non-contiguous
%timeit A.sum(axis=0)
# 326 ms ± 9.17 ms

B=np.ones((N,2), dtype=np.float32, order="F") # contiguous
%timeit B.sum(axis=0)
# 15.6 ms ± 898 µs

In this special case, with only 2 elements in a row, the overhead is just too big (see also similar behavior explained here).

It can be done better, for example via still imprecise einsum:

%timeit np.einsum("i...->...", A)
# 74.5 ms ± 1.47 ms 
np.einsum("i...->...", A)
# array([16777216.,  16777216.], dtype=float32)

or even:

%timeit np.array([A[:,0].sum(), A[:,1].sum()], dtype=np.float32)
# 17.8 ms ± 333 µs 
np.array([A[:,0].sum(), A[:,1].sum()], dtype=np.float32)
# array([17000000., 17000000.], dtype=float32)

which is not only almost as fast as the contiguous version (the penalty of loading the memory twice isn't as high as loading memory 16 times), but also precise, because sum is used for one-dimensional numpy-arrays.

For more columns, the difference to contiguous case is much smaller for numpy's and einsum-ways:

B=np.ones((N,16), dtype=np.float32, order="F")
%timeit B.sum(axis=0)
# 121 ms ± 3.66 ms 

A=np.ones((N,16), dtype=np.float32, order="C")
%timeit A.sum(axis=0)
# 457 ms ± 12.1 ms 

%timeit np.einsum("i...->...", A)
# 139 ms ± 651 µs per loop

But the performance is very bad for the "precise" trick, probably because latency can no longer be hidden by calculations:

def do(A):
    N=A.shape[1]
    res=np.zeros(N, dtype=np.float32)
    for i in range(N):
        res[i]=A[:,i].sum()
    return res
%timeit do(A)
# 1.39 s ± 47.8 ms

Here are the gory details of numpy's implementation.

The difference can be seen in code of FLOAT_add with defines from here:

#define IS_BINARY_REDUCE ((args[0] == args[2])\
    && (steps[0] == steps[2])\
    && (steps[0] == 0))

#define BINARY_REDUCE_LOOP(TYPE)\
   char *iop1 = args[0]; \
   TYPE io1 = *(TYPE *)iop1; \

/** (ip1, ip2) -> (op1) */
#define BINARY_LOOP\
    char *ip1 = args[0], *ip2 = args[1], *op1 = args[2];\
    npy_intp is1 = steps[0], is2 = steps[1], os1 = steps[2];\
    npy_intp n = dimensions[0];\
    npy_intp i;\
    for(i = 0; i < n; i++, ip1 += is1, ip2 += is2, op1 += os1)

/**begin repeat
* Float types
*  #type = npy_float, npy_double, npy_longdouble#
*  #TYPE = FLOAT, DOUBLE, LONGDOUBLE#
*  #c = f, , l#
*  #C = F, , L#
*/

/**begin repeat1
 * Arithmetic
 * # kind = add, subtract, multiply, divide#
 * # OP = +, -, *, /#
 * # PW = 1, 0, 0, 0#
 */
NPY_NO_EXPORT void
@TYPE@_@kind@(char **args, npy_intp *dimensions, npy_intp *steps, void *NPY_UNUSED(func))
{
    if (IS_BINARY_REDUCE) {
#if @PW@
        @type@ * iop1 = (@type@ *)args[0];
        npy_intp n = dimensions[0];

        *iop1 @OP@= pairwise_sum_@TYPE@(args[1], n, steps[1]);
#else
        BINARY_REDUCE_LOOP(@type@) {
            io1 @OP@= *(@type@ *)ip2;
        }
        *((@type@ *)iop1) = io1;
#endif
    }
    else if (!run_binary_simd_@kind@_@TYPE@(args, dimensions, steps)) {
        BINARY_LOOP {
            const @type@ in1 = *(@type@ *)ip1;
            const @type@ in2 = *(@type@ *)ip2;
            *((@type@ *)op1) = in1 @OP@ in2;
        }
    }
}

which once generated looks like follows:

NPY_NO_EXPORT void
FLOAT_add(char **args, npy_intp *dimensions, npy_intp *steps, void *NPY_UNUSED(func))
{
    if (IS_BINARY_REDUCE) {
#if 1
        npy_float * iop1 = (npy_float *)args[0];
        npy_intp n = dimensions[0];

        *iop1 += pairwise_sum_FLOAT((npy_float *)args[1], n,
                                        steps[1] / (npy_intp)sizeof(npy_float));
#else
        BINARY_REDUCE_LOOP(npy_float) {
            io1 += *(npy_float *)ip2;
        }
        *((npy_float *)iop1) = io1;
#endif
    }
    else if (!run_binary_simd_add_FLOAT(args, dimensions, steps)) {
        BINARY_LOOP {
            const npy_float in1 = *(npy_float *)ip1;
            const npy_float in2 = *(npy_float *)ip2;
            *((npy_float *)op1) = in1 + in2;
        }
    }
}

FLOAT_add can be used for one dimensional reduction, in this case: