So I am trying to write a python function to return a metric called the Mielke-Berry R value. The metric is calculated like so:
The current code I have written works, b
If you are not memory constrained, the first step to optimize nested loops in numpy
is to use broadcasting and perform operations in a vectorized way:
import numpy as np
def mb_r(forecasted_array, observed_array):
"""Returns the Mielke-Berry R value."""
assert len(observed_array) == len(forecasted_array)
total = np.abs(forecasted_array[:, np.newaxis] - observed_array).sum() # Broadcasting
return 1 - (mae(forecasted_array, observed_array) * forecasted_array.size ** 2 / total[0])
But while in this case looping occurs in C instead of Python it involves allocation of a size (N, N) array.
As it was noted above broadcasting implies huge memory overhead. So it should be used with care and it is not always the right way. While you may have first impression to use it everywhere - do not. Not so long ago I was also confused by this fact, see my question Numpy ufuncs speed vs for loop speed. Not to be too verbose, I will show this on yours example:
import numpy as np
# Broadcast version
def mb_r_bcast(forecasted_array, observed_array):
return np.abs(forecasted_array[:, np.newaxis] - observed_array).sum()
# Inner loop unrolled version
def mb_r_unroll(forecasted_array, observed_array):
size = len(observed_array)
total = 0.
for i in range(size): # There is only one loop
total += np.abs(forecasted_array - observed_array[i]).sum()
return total
Small-size arrays (broadcasting is faster)
forecasted = np.random.rand(100)
observed = np.random.rand(100)
%timeit mb_r_bcast(forecasted, observed)
57.5 µs ± 359 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit mb_r_unroll(forecasted, observed)
1.17 ms ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Medium-size arrays (equal)
forecasted = np.random.rand(1000)
observed = np.random.rand(1000)
%timeit mb_r_bcast(forecasted, observed)
15.6 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit mb_r_unroll(forecasted, observed)
16.4 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Large-size arrays (broadcasting is slower)
forecasted = np.random.rand(10000)
observed = np.random.rand(10000)
%timeit mb_r_bcast(forecasted, observed)
1.51 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit mb_r_unroll(forecasted, observed)
377 ms ± 994 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
As you can see for small-sized arrays broadcast version is 20x faster than unrolled, for medium-sized arrays they are rather equal, but for large-sized arrays it is 4x slower because memory overhead is paying its own costly price.
Another approach is to use numba
and its magic powerful @jit
function-decorator. In this case, only slight modification of your initial code is necessary. Also to make loops parallel you should change range
to prange
and provide parallel=True
keyword argument. In the snippet below I use the @njit
decorator which is the same as the @jit(nopython=True)
:
from numba import njit, prange
@njit(parallel=True)
def mb_r_njit(forecasted_array, observed_array):
"""Returns the Mielke-Berry R value."""
assert len(observed_array) == len(forecasted_array)
total = 0.
size = len(forecasted_array)
for i in prange(size):
observed = observed_array[i]
for j in prange(size):
total += abs(forecasted_array[j] - observed)
return 1 - (mae(forecasted_array, observed_array) * size ** 2 / total)
You didn't provide mae
function, but to run the code in njit
mode you must also decorate mae
function, or if it is a number pass it as an argument to the jitted function.
Python scientific ecosystem is huge, I just mention some other equivalent options to speed up: Cython
, Nuitka
, Pythran
, bottleneck
and many others. Perhaps you are interested in gpu computing
, but this is actually another story.
On my computer, unfortunately the old one, the timings are:
import numpy as np
import numexpr as ne
forecasted_array = np.random.rand(10000)
observed_array = np.random.rand(10000)
initial version
%timeit mb_r(forecasted_array, observed_array)
23.4 s ± 430 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numexpr
%%timeit
forecasted_array2d = forecasted_array[:, np.newaxis]
ne.evaluate('sum(abs(forecasted_array2d - observed_array))')[()]
784 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
broadcast version
%timeit mb_r_bcast(forecasted, observed)
1.47 s ± 4.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
inner loop unrolled version
%timeit mb_r_unroll(forecasted, observed)
389 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numba njit(parallel=True)
version
%timeit mb_r_njit(forecasted_array, observed_array)
32 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It can be seen that njit
approach is 730x faster then your initial solution, and also 24.5x faster than numexpr
solution (maybe you need Intel's Vector Math Library to accelerate it). Also simple approach with the inner loop unrolling gives you 60x speed up compared to your initial version. My specs are:
Intel(R) Core(TM)2 Quad CPU Q9550 2.83GHz
Python 3.6.3
numpy 1.13.3
numba 0.36.1
numexpr 2.6.4
I was surprised by your phrase "I have heard (haven't yet tested) that indexing a numpy array using a python for loop is very slow." So I test:
arr = np.arange(1000)
ls = arr.tolistist()
%timeit for i in arr: pass
69.5 µs ± 282 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit for i in ls: pass
13.3 µs ± 81.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit for i in range(len(arr)): arr[i]
167 µs ± 997 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit for i in range(len(ls)): ls[i]
90.8 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
and it turns out that you are right. It is 2-5x faster to iterate over the list. Of course, these results must be taken with a certain amount of irony :)
Here's one vectorized way to leverage broadcasting to get total
-
np.abs(forecasted_array[:,None] - observed_array).sum()
To accept both lists and arrays alike, we can use NumPy builtin for the outer subtraction, like so -
np.abs(np.subtract.outer(forecasted_array, observed_array)).sum()
We can also make use of numexpr module for faster absolute
computations and perform summation-reductions
in one single numexpr evaluate
call and as such would be much more memory efficient, like so -
import numexpr as ne
forecasted_array2D = forecasted_array[:,None]
total = ne.evaluate('sum(abs(forecasted_array2D - observed_array))')
As a reference, the following code:
#pythran export mb_r(float64[], float64[])
import numpy as np
def mb_r(forecasted_array, observed_array):
return np.abs(forecasted_array[:,None] - observed_array).sum()
Runs at the following speed on pure CPython:
% python -m perf timeit -s 'import numpy as np; x = np.random.rand(400); y = np.random.rand(400); from mbr import mb_r' 'mb_r(x, y)'
.....................
Mean +- std dev: 730 us +- 35 us
And when compiled with Pythran I get
% pythran -march=native -DUSE_BOOST_SIMD mbr.py
% python -m perf timeit -s 'import numpy as np; x = np.random.rand(400); y = np.random.rand(400); from mbr import mb_r' 'mb_r(x, y)'
.....................
Mean +- std dev: 65.8 us +- 1.7 us
So roughly a x10 speedup, on a single core with AVX extension.