I\'ve always assumed, that numpy uses a kind of pairwise-summation, which ensures high precision also for float32
- operations:
import numpy as
This behavior is due to the way how numpy access memory during a reduce-operation ("add" being only a special case) in order to improve the utilization of cache.
For some cases (as the one above), one could enforce pairwise summation without big impact on performance. But in general, enforcing it would lead to massive performance loss - it might be easier to use double-precision which would mitigate the above issue in most cases.
Pairwise summation can be seen, as a very specific optimization for the "add"-operation, which is done if some constrains (more on this later) are met.
Summation (and many other reduce-operations) is memory-bandwidth bound. The life is good if we sum along a contiguous axis: The memory fetched into the cache for the index i
will be directly reused for calculation with index i+1
, i+2
,... without being evicted from cache, prior to being used.
The situation is different, when the summation isn't along a contiguous axis: to add an float32-element 16-float32s are fetched into the cache, but 15 of them are evicted before they can be used, and must be fetched again - what a waste.
That is the reason, why numpy makes summation row-wise in this case: summing first and second rows, then adding third row to the result, then the fourth and so on. However, pairwise summation is only implemented for one-dimensional summation and cannot be used here.
The pairwise summation is performed, when:
sum
is called on a one-dimensional numpy-arraysum
is called along a contiguous axisnumpy doesn't (yet?) offer a way to enforce pairwise summation without a major negative impact on performance.
My take away from it: the goal should be to perform summation along contiguous axis, which is not only more precise, but also might be much faster:
A=np.ones((N,2), dtype=np.float32, order="C") #non-contiguous
%timeit A.sum(axis=0)
# 326 ms ± 9.17 ms
B=np.ones((N,2), dtype=np.float32, order="F") # contiguous
%timeit B.sum(axis=0)
# 15.6 ms ± 898 µs
In this special case, with only 2 elements in a row, the overhead is just too big (see also similar behavior explained here).
It can be done better, for example via still imprecise einsum
:
%timeit np.einsum("i...->...", A)
# 74.5 ms ± 1.47 ms
np.einsum("i...->...", A)
# array([16777216., 16777216.], dtype=float32)
or even:
%timeit np.array([A[:,0].sum(), A[:,1].sum()], dtype=np.float32)
# 17.8 ms ± 333 µs
np.array([A[:,0].sum(), A[:,1].sum()], dtype=np.float32)
# array([17000000., 17000000.], dtype=float32)
which is not only almost as fast as the contiguous version (the penalty of loading the memory twice isn't as high as loading memory 16 times), but also precise, because sum
is used for one-dimensional numpy-arrays.
For more columns, the difference to contiguous case is much smaller for numpy's and einsum-ways:
B=np.ones((N,16), dtype=np.float32, order="F")
%timeit B.sum(axis=0)
# 121 ms ± 3.66 ms
A=np.ones((N,16), dtype=np.float32, order="C")
%timeit A.sum(axis=0)
# 457 ms ± 12.1 ms
%timeit np.einsum("i...->...", A)
# 139 ms ± 651 µs per loop
But the performance is very bad for the "precise" trick, probably because latency can no longer be hidden by calculations:
def do(A):
N=A.shape[1]
res=np.zeros(N, dtype=np.float32)
for i in range(N):
res[i]=A[:,i].sum()
return res
%timeit do(A)
# 1.39 s ± 47.8 ms
Here are the gory details of numpy's implementation.
The difference can be seen in code of FLOAT_add with defines from here:
#define IS_BINARY_REDUCE ((args[0] == args[2])\
&& (steps[0] == steps[2])\
&& (steps[0] == 0))
#define BINARY_REDUCE_LOOP(TYPE)\
char *iop1 = args[0]; \
TYPE io1 = *(TYPE *)iop1; \
/** (ip1, ip2) -> (op1) */
#define BINARY_LOOP\
char *ip1 = args[0], *ip2 = args[1], *op1 = args[2];\
npy_intp is1 = steps[0], is2 = steps[1], os1 = steps[2];\
npy_intp n = dimensions[0];\
npy_intp i;\
for(i = 0; i < n; i++, ip1 += is1, ip2 += is2, op1 += os1)
/**begin repeat
* Float types
* #type = npy_float, npy_double, npy_longdouble#
* #TYPE = FLOAT, DOUBLE, LONGDOUBLE#
* #c = f, , l#
* #C = F, , L#
*/
/**begin repeat1
* Arithmetic
* # kind = add, subtract, multiply, divide#
* # OP = +, -, *, /#
* # PW = 1, 0, 0, 0#
*/
NPY_NO_EXPORT void
@TYPE@_@kind@(char **args, npy_intp *dimensions, npy_intp *steps, void *NPY_UNUSED(func))
{
if (IS_BINARY_REDUCE) {
#if @PW@
@type@ * iop1 = (@type@ *)args[0];
npy_intp n = dimensions[0];
*iop1 @OP@= pairwise_sum_@TYPE@(args[1], n, steps[1]);
#else
BINARY_REDUCE_LOOP(@type@) {
io1 @OP@= *(@type@ *)ip2;
}
*((@type@ *)iop1) = io1;
#endif
}
else if (!run_binary_simd_@kind@_@TYPE@(args, dimensions, steps)) {
BINARY_LOOP {
const @type@ in1 = *(@type@ *)ip1;
const @type@ in2 = *(@type@ *)ip2;
*((@type@ *)op1) = in1 @OP@ in2;
}
}
}
which once generated looks like follows:
NPY_NO_EXPORT void
FLOAT_add(char **args, npy_intp *dimensions, npy_intp *steps, void *NPY_UNUSED(func))
{
if (IS_BINARY_REDUCE) {
#if 1
npy_float * iop1 = (npy_float *)args[0];
npy_intp n = dimensions[0];
*iop1 += pairwise_sum_FLOAT((npy_float *)args[1], n,
steps[1] / (npy_intp)sizeof(npy_float));
#else
BINARY_REDUCE_LOOP(npy_float) {
io1 += *(npy_float *)ip2;
}
*((npy_float *)iop1) = io1;
#endif
}
else if (!run_binary_simd_add_FLOAT(args, dimensions, steps)) {
BINARY_LOOP {
const npy_float in1 = *(npy_float *)ip1;
const npy_float in2 = *(npy_float *)ip2;
*((npy_float *)op1) = in1 + in2;
}
}
}
FLOAT_add
can be used for one dimensional reduction, in this case:
args[0]
is the pointer to the result/initial value (the same as args[2]
)args[1]
is the input array steps[0]
and steps[2]
are 0
, i.e. the pointers are to a scalar.and then pairwise summation can be used (checked with IS_BINARY_REDUCE
).
FLOAT_add
can be used to add two vectors, in this case:
args[0]
first input arrayargs[1]
second input arrayargs[2]
output arraysteps
- steps from one element to another in the array for the above arrays.Parameter @PW@
is 1
only for summation - for all other operations, pairwise summation isn't used.
I don't really have an explanation but it seems related to the memory layout. Using fortran order instead of the default C order I get the desired output.
>>> np.ones((N,2),dtype=np.float32, order='C').sum(axis=0)
array([16777216., 16777216.], dtype=float32)
>>> np.ones((N,2),dtype=np.float32, order='F').sum(axis=0)
array([17000000., 17000000.], dtype=float32)