Suppose we have N numbers(integers, floats, whatever you want) and want to find their arithmetic mean. Simplest method is to sum all values and divide by number of values:
Here's a way to calculate the mean using just integers without rounding errors and avoiding big intermediate values:
sum = 0
rest = 0
for num in numbers:
sum += num / N
rest += num % N
sum += rest / N
rest = rest % N
return sum, rest
If you use float
you might avoid big integers:
def simple_mean(array[N]):
sum = 0.0 # <---
for i = 1 to N
sum += array[i]
return sum / N
The Kahan algorithm (according to the wikipedia) has better runtime performance (than the pairwise summation) -O(n)
- and an O(1)
error growth:
function KahanSum(input)
var sum = 0.0
var c = 0.0 // A running compensation for lost low-order bits.
for i = 1 to input.length do
var y = input[i] - c // So far, so good: c is zero.
var t = sum + y // Alas, sum is big, y small, so low-order digits of y are lost.
c = (t - sum) - y // (t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
sum = t // Algebraically, c should always be zero. Beware overly-aggressive optimizing compilers!
// Next time around, the lost low part will be added to y in a fresh attempt.
return sum
Its idea is that the low bits of the floating point numbers are summed and corrected independently from the main summation.
If the big integers are problem... is it ok
a/N + b/N+.... n/N
I mean you're looking just for other ways or the optimal way?
If the array is floating-point data, even the "simple" algorithm suffers from rounding error. Interestingly, in that case, blocking the computation into sqrt(N) sums of length sqrt(N) actually reduces the error in the average case (even though the same number of floating-point roundings are performed).
For integer data, note that you don't need general "big integers"; if you have fewer than 4 billion elements in your array (likely), you only need an integer type 32 bits larger than that the type of the array data. Performing addition on this slightly larger type will pretty much always be faster than doing division or modulus on the type itself. For example, on most 32 bit systems, 64-bit addition is faster than 32-bit division/modulus. This effect will only become more exaggerated as the types become larger.
Knuth lists the following method for calculating mean and standard deviation given floating point (original on p. 232 of Vol 2 of The Art of Computer Programming, 1998 edition; my adaptation below avoids special-casing the first iteration):
double M=0, S=0;
for (int i = 0; i < N; ++i)
{
double Mprev = M;
M += (x[i] - M)/(i+1);
S += (x[i] - M)*(x[i] - Mprev);
}
// mean = M
// std dev = sqrt(S/N) or sqrt(S/N+1)
// depending on whether you want population or sample std dev