How to optimize a nested for loop in Python

前端 未结 3 753
轻奢々
轻奢々 2021-02-08 10:34

So I am trying to write a python function to return a metric called the Mielke-Berry R value. The metric is calculated like so:

The current code I have written works, b

相关标签:
3条回答
  • 2021-02-08 10:57

    Broadcasting in numpy

    If you are not memory constrained, the first step to optimize nested loops in numpy is to use broadcasting and perform operations in a vectorized way:

    import numpy as np    
    
    def mb_r(forecasted_array, observed_array):
            """Returns the Mielke-Berry R value."""
            assert len(observed_array) == len(forecasted_array)
            total = np.abs(forecasted_array[:, np.newaxis] - observed_array).sum() # Broadcasting
            return 1 - (mae(forecasted_array, observed_array) * forecasted_array.size ** 2 / total[0])
    

    But while in this case looping occurs in C instead of Python it involves allocation of a size (N, N) array.

    Broadcasting is not a panacea, try to unroll the inner loop

    As it was noted above broadcasting implies huge memory overhead. So it should be used with care and it is not always the right way. While you may have first impression to use it everywhere - do not. Not so long ago I was also confused by this fact, see my question Numpy ufuncs speed vs for loop speed. Not to be too verbose, I will show this on yours example:

    import numpy as np
    
    # Broadcast version
    def mb_r_bcast(forecasted_array, observed_array):
        return np.abs(forecasted_array[:, np.newaxis] - observed_array).sum()
    
    # Inner loop unrolled version
    def mb_r_unroll(forecasted_array, observed_array):
        size = len(observed_array)
        total = 0.
        for i in range(size):  # There is only one loop
            total += np.abs(forecasted_array - observed_array[i]).sum()
        return total
    

    Small-size arrays (broadcasting is faster)

    forecasted = np.random.rand(100)
    observed = np.random.rand(100)
    
    %timeit mb_r_bcast(forecasted, observed)
    57.5 µs ± 359 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    %timeit mb_r_unroll(forecasted, observed)
    1.17 ms ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    Medium-size arrays (equal)

    forecasted = np.random.rand(1000)
    observed = np.random.rand(1000)
    
    %timeit mb_r_bcast(forecasted, observed)
    15.6 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    %timeit mb_r_unroll(forecasted, observed)
    16.4 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Large-size arrays (broadcasting is slower)

    forecasted = np.random.rand(10000)
    observed = np.random.rand(10000)
    
    %timeit mb_r_bcast(forecasted, observed)
    1.51 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    %timeit mb_r_unroll(forecasted, observed)
    377 ms ± 994 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    As you can see for small-sized arrays broadcast version is 20x faster than unrolled, for medium-sized arrays they are rather equal, but for large-sized arrays it is 4x slower because memory overhead is paying its own costly price.

    Numba jit and parallelization

    Another approach is to use numba and its magic powerful @jit function-decorator. In this case, only slight modification of your initial code is necessary. Also to make loops parallel you should change range to prange and provide parallel=True keyword argument. In the snippet below I use the @njit decorator which is the same as the @jit(nopython=True):

    from numba import njit, prange
    
    @njit(parallel=True)
    def mb_r_njit(forecasted_array, observed_array):
        """Returns the Mielke-Berry R value."""
        assert len(observed_array) == len(forecasted_array)
        total = 0.
        size = len(forecasted_array)
        for i in prange(size):
            observed = observed_array[i]
            for j in prange(size):
                total += abs(forecasted_array[j] - observed)
        return 1 - (mae(forecasted_array, observed_array) * size ** 2 / total)
    

    You didn't provide mae function, but to run the code in njit mode you must also decorate mae function, or if it is a number pass it as an argument to the jitted function.

    Other options

    Python scientific ecosystem is huge, I just mention some other equivalent options to speed up: Cython, Nuitka, Pythran, bottleneck and many others. Perhaps you are interested in gpu computing, but this is actually another story.

    Timings

    On my computer, unfortunately the old one, the timings are:

    import numpy as np
    import numexpr as ne
    
    forecasted_array = np.random.rand(10000)
    observed_array   = np.random.rand(10000)
    

    initial version

    %timeit mb_r(forecasted_array, observed_array)
    23.4 s ± 430 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    numexpr

    %%timeit
    forecasted_array2d = forecasted_array[:, np.newaxis]
    ne.evaluate('sum(abs(forecasted_array2d - observed_array))')[()]
    784 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    broadcast version

    %timeit mb_r_bcast(forecasted, observed)
    1.47 s ± 4.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    inner loop unrolled version

    %timeit mb_r_unroll(forecasted, observed)
    389 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    numba njit(parallel=True) version

    %timeit mb_r_njit(forecasted_array, observed_array)
    32 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    It can be seen that njit approach is 730x faster then your initial solution, and also 24.5x faster than numexpr solution (maybe you need Intel's Vector Math Library to accelerate it). Also simple approach with the inner loop unrolling gives you 60x speed up compared to your initial version. My specs are:

    Intel(R) Core(TM)2 Quad CPU Q9550 2.83GHz
    Python 3.6.3
    numpy 1.13.3
    numba 0.36.1
    numexpr 2.6.4

    Final Note

    I was surprised by your phrase "I have heard (haven't yet tested) that indexing a numpy array using a python for loop is very slow." So I test:

    arr = np.arange(1000)
    ls = arr.tolistist()
    
    %timeit for i in arr: pass
    69.5 µs ± 282 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    %timeit for i in ls: pass
    13.3 µs ± 81.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    
    %timeit for i in range(len(arr)): arr[i]
    167 µs ± 997 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    %timeit for i in range(len(ls)): ls[i]
    90.8 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    and it turns out that you are right. It is 2-5x faster to iterate over the list. Of course, these results must be taken with a certain amount of irony :)

    0 讨论(0)
  • 2021-02-08 10:58

    Here's one vectorized way to leverage broadcasting to get total -

    np.abs(forecasted_array[:,None] - observed_array).sum()
    

    To accept both lists and arrays alike, we can use NumPy builtin for the outer subtraction, like so -

    np.abs(np.subtract.outer(forecasted_array, observed_array)).sum()
    

    We can also make use of numexpr module for faster absolute computations and perform summation-reductions in one single numexpr evaluate call and as such would be much more memory efficient, like so -

    import numexpr as ne
    
    forecasted_array2D = forecasted_array[:,None]
    total = ne.evaluate('sum(abs(forecasted_array2D - observed_array))')
    
    0 讨论(0)
  • 2021-02-08 11:19

    As a reference, the following code:

    #pythran export mb_r(float64[], float64[])
    import numpy as np
    
    def mb_r(forecasted_array, observed_array):
        return np.abs(forecasted_array[:,None] - observed_array).sum()
    

    Runs at the following speed on pure CPython:

    % python -m perf timeit -s 'import numpy as np; x = np.random.rand(400); y = np.random.rand(400); from mbr import mb_r' 'mb_r(x, y)' 
    .....................
    Mean +- std dev: 730 us +- 35 us
    

    And when compiled with Pythran I get

    % pythran -march=native -DUSE_BOOST_SIMD mbr.py
    % python -m perf timeit -s 'import numpy as np; x = np.random.rand(400); y = np.random.rand(400); from mbr import mb_r' 'mb_r(x, y)'
    .....................
    Mean +- std dev: 65.8 us +- 1.7 us
    

    So roughly a x10 speedup, on a single core with AVX extension.

    0 讨论(0)
提交回复
热议问题