Most efficient way to map function over numpy array

前端 未结 11 1330
庸人自扰
庸人自扰 2020-11-22 02:13

What is the most efficient way to map a function over a numpy array? The way I\'ve been doing it in my current project is as follows:

import numpy as np 

x          


        
11条回答
  •  孤街浪徒
    2020-11-22 02:36

    There are numexpr, numba and cython around, the goal of this answer is to take these possibilities into consideration.

    But first let's state the obvious: no matter how you map a Python-function onto a numpy-array, it stays a Python function, that means for every evaluation:

    • numpy-array element must be converted to a Python-object (e.g. a Float).
    • all calculations are done with Python-objects, which means to have the overhead of interpreter, dynamic dispatch and immutable objects.

    So which machinery is used to actually loop through the array doesn't play a big role because of the overhead mentioned above - it stays much slower than using numpy's built-in functionality.

    Let's take a look at the following example:

    # numpy-functionality
    def f(x):
        return x+2*x*x+4*x*x*x
    
    # python-function as ufunc
    import numpy as np
    vf=np.vectorize(f)
    vf.__name__="vf"
    

    np.vectorize is picked as a representative of the pure-python function class of approaches. Using perfplot (see code in the appendix of this answer) we get the following running times:

    We can see, that the numpy-approach is 10x-100x faster than the pure python version. The decrease of performance for bigger array-sizes is probably because data no longer fits the cache.

    It is worth also mentioning, that vectorize also uses a lot of memory, so often memory-usage is the bottle-neck (see related SO-question). Also note, that numpy's documentation on np.vectorize states that it is "provided primarily for convenience, not for performance".

    Other tools should be used, when performance is desired, beside writing a C-extension from the scratch, there are following possibilities:


    One often hears, that the numpy-performance is as good as it gets, because it is pure C under the hood. Yet there is a lot room for improvement!

    The vectorized numpy-version uses a lot of additional memory and memory-accesses. Numexp-library tries to tile the numpy-arrays and thus get a better cache utilization:

    # less cache misses than numpy-functionality
    import numexpr as ne
    def ne_f(x):
        return ne.evaluate("x+2*x*x+4*x*x*x")
    

    Leads to the following comparison:

    I cannot explain everything in the plot above: we can see bigger overhead for numexpr-library at the beginning, but because it utilize the cache better it is about 10 time faster for bigger arrays!


    Another approach is to jit-compile the function and thus getting a real pure-C UFunc. This is numba's approach:

    # runtime generated C-function as ufunc
    import numba as nb
    @nb.vectorize(target="cpu")
    def nb_vf(x):
        return x+2*x*x+4*x*x*x
    

    It is 10 times faster than the original numpy-approach:


    However, the task is embarrassingly parallelizable, thus we also could use prange in order to calculate the loop in parallel:

    @nb.njit(parallel=True)
    def nb_par_jitf(x):
        y=np.empty(x.shape)
        for i in nb.prange(len(x)):
            y[i]=x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
        return y
    

    As expected, the parallel function is slower for smaller inputs, but faster (almost factor 2) for larger sizes:


    While numba specializes on optimizing operations with numpy-arrays, Cython is a more general tool. It is more complicated to extract the same performance as with numba - often it is down to llvm (numba) vs local compiler (gcc/MSVC):

    %%cython -c=/openmp -a
    import numpy as np
    import cython
    
    #single core:
    @cython.boundscheck(False) 
    @cython.wraparound(False) 
    def cy_f(double[::1] x):
        y_out=np.empty(len(x))
        cdef Py_ssize_t i
        cdef double[::1] y=y_out
        for i in range(len(x)):
            y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
        return y_out
    
    #parallel:
    from cython.parallel import prange
    @cython.boundscheck(False) 
    @cython.wraparound(False)  
    def cy_par_f(double[::1] x):
        y_out=np.empty(len(x))
        cdef double[::1] y=y_out
        cdef Py_ssize_t i
        cdef Py_ssize_t n = len(x)
        for i in prange(n, nogil=True):
            y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
        return y_out
    

    Cython results in somewhat slower functions:


    Conclusion

    Obviously, testing only for one function doesn't prove anything. Also one should keep in mind, that for the choosen function-example, the bandwidth of the memory was the bottle neck for sizes larger than 10^5 elements - thus we had the same performance for numba, numexpr and cython in this region.

    In the end, the ultimative answer depends on the type of function, hardware, Python-distribution and other factors. For example Anaconda-distribution uses Intel's VML for numpy's functions and thus outperforms numba (unless it uses SVML, see this SO-post) easily for transcendental functions like exp, sin, cos and similar - see e.g. the following SO-post.

    Yet from this investigation and from my experience so far, I would state, that numba seems to be the easiest tool with best performance as long as no transcendental functions are involved.


    Plotting running times with perfplot-package:

    import perfplot
    perfplot.show(
        setup=lambda n: np.random.rand(n),
        n_range=[2**k for k in range(0,24)],
        kernels=[
            f, 
            vf,
            ne_f, 
            nb_vf, nb_par_jitf,
            cy_f, cy_par_f,
            ],
        logx=True,
        logy=True,
        xlabel='len(x)'
        )
    

提交回复
热议问题