Most efficient way to map function over numpy array

前端 未结 11 1341
庸人自扰
庸人自扰 2020-11-22 02:13

What is the most efficient way to map a function over a numpy array? The way I\'ve been doing it in my current project is as follows:

import numpy as np 

x          


        
相关标签:
11条回答
  • 2020-11-22 02:34

    TL;DR

    As noted by @user2357112, a "direct" method of applying the function is always the fastest and simplest way to map a function over Numpy arrays:

    import numpy as np
    x = np.array([1, 2, 3, 4, 5])
    f = lambda x: x ** 2
    squares = f(x)
    

    Generally avoid np.vectorize, as it does not perform well, and has (or had) a number of issues. If you are handling other data types, you may want to investigate the other methods shown below.

    Comparison of methods

    Here are some simple tests to compare three methods to map a function, this example using with Python 3.6 and NumPy 1.15.4. First, the set-up functions for testing:

    import timeit
    import numpy as np
    
    f = lambda x: x ** 2
    vf = np.vectorize(f)
    
    def test_array(x, n):
        t = timeit.timeit(
            'np.array([f(xi) for xi in x])',
            'from __main__ import np, x, f', number=n)
        print('array: {0:.3f}'.format(t))
    
    def test_fromiter(x, n):
        t = timeit.timeit(
            'np.fromiter((f(xi) for xi in x), x.dtype, count=len(x))',
            'from __main__ import np, x, f', number=n)
        print('fromiter: {0:.3f}'.format(t))
    
    def test_direct(x, n):
        t = timeit.timeit(
            'f(x)',
            'from __main__ import x, f', number=n)
        print('direct: {0:.3f}'.format(t))
    
    def test_vectorized(x, n):
        t = timeit.timeit(
            'vf(x)',
            'from __main__ import x, vf', number=n)
        print('vectorized: {0:.3f}'.format(t))
    

    Testing with five elements (sorted from fastest to slowest):

    x = np.array([1, 2, 3, 4, 5])
    n = 100000
    test_direct(x, n)      # 0.265
    test_fromiter(x, n)    # 0.479
    test_array(x, n)       # 0.865
    test_vectorized(x, n)  # 2.906
    

    With 100s of elements:

    x = np.arange(100)
    n = 10000
    test_direct(x, n)      # 0.030
    test_array(x, n)       # 0.501
    test_vectorized(x, n)  # 0.670
    test_fromiter(x, n)    # 0.883
    

    And with 1000s of array elements or more:

    x = np.arange(1000)
    n = 1000
    test_direct(x, n)      # 0.007
    test_fromiter(x, n)    # 0.479
    test_array(x, n)       # 0.516
    test_vectorized(x, n)  # 0.945
    

    Different versions of Python/NumPy and compiler optimization will have different results, so do a similar test for your environment.

    0 讨论(0)
  • 2020-11-22 02:34

    I believe in newer version( I use 1.13) of numpy you can simply call the function by passing the numpy array to the fuction that you wrote for scalar type, it will automatically apply the function call to each element over the numpy array and return you another numpy array

    >>> import numpy as np
    >>> squarer = lambda t: t ** 2
    >>> x = np.array([1, 2, 3, 4, 5])
    >>> squarer(x)
    array([ 1,  4,  9, 16, 25])
    
    0 讨论(0)
  • 2020-11-22 02:35

    Edit: the original answer was misleading, np.sqrt was applied directly to the array, just with a small overhead.

    In multidimensional cases where you want to apply a builtin function that operates on a 1d array, numpy.apply_along_axis is a good choice, also for more complex function compositions from numpy and scipy.

    Previous misleading statement:

    Adding the method:

    def along_axis(x):
        return np.apply_along_axis(f, 0, x)
    

    to the perfplot code gives performance results close to np.sqrt.

    0 讨论(0)
  • 2020-11-22 02:36

    There are numexpr, numba and cython around, the goal of this answer is to take these possibilities into consideration.

    But first let's state the obvious: no matter how you map a Python-function onto a numpy-array, it stays a Python function, that means for every evaluation:

    • numpy-array element must be converted to a Python-object (e.g. a Float).
    • all calculations are done with Python-objects, which means to have the overhead of interpreter, dynamic dispatch and immutable objects.

    So which machinery is used to actually loop through the array doesn't play a big role because of the overhead mentioned above - it stays much slower than using numpy's built-in functionality.

    Let's take a look at the following example:

    # numpy-functionality
    def f(x):
        return x+2*x*x+4*x*x*x
    
    # python-function as ufunc
    import numpy as np
    vf=np.vectorize(f)
    vf.__name__="vf"
    

    np.vectorize is picked as a representative of the pure-python function class of approaches. Using perfplot (see code in the appendix of this answer) we get the following running times:

    We can see, that the numpy-approach is 10x-100x faster than the pure python version. The decrease of performance for bigger array-sizes is probably because data no longer fits the cache.

    It is worth also mentioning, that vectorize also uses a lot of memory, so often memory-usage is the bottle-neck (see related SO-question). Also note, that numpy's documentation on np.vectorize states that it is "provided primarily for convenience, not for performance".

    Other tools should be used, when performance is desired, beside writing a C-extension from the scratch, there are following possibilities:


    One often hears, that the numpy-performance is as good as it gets, because it is pure C under the hood. Yet there is a lot room for improvement!

    The vectorized numpy-version uses a lot of additional memory and memory-accesses. Numexp-library tries to tile the numpy-arrays and thus get a better cache utilization:

    # less cache misses than numpy-functionality
    import numexpr as ne
    def ne_f(x):
        return ne.evaluate("x+2*x*x+4*x*x*x")
    

    Leads to the following comparison:

    I cannot explain everything in the plot above: we can see bigger overhead for numexpr-library at the beginning, but because it utilize the cache better it is about 10 time faster for bigger arrays!


    Another approach is to jit-compile the function and thus getting a real pure-C UFunc. This is numba's approach:

    # runtime generated C-function as ufunc
    import numba as nb
    @nb.vectorize(target="cpu")
    def nb_vf(x):
        return x+2*x*x+4*x*x*x
    

    It is 10 times faster than the original numpy-approach:


    However, the task is embarrassingly parallelizable, thus we also could use prange in order to calculate the loop in parallel:

    @nb.njit(parallel=True)
    def nb_par_jitf(x):
        y=np.empty(x.shape)
        for i in nb.prange(len(x)):
            y[i]=x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
        return y
    

    As expected, the parallel function is slower for smaller inputs, but faster (almost factor 2) for larger sizes:


    While numba specializes on optimizing operations with numpy-arrays, Cython is a more general tool. It is more complicated to extract the same performance as with numba - often it is down to llvm (numba) vs local compiler (gcc/MSVC):

    %%cython -c=/openmp -a
    import numpy as np
    import cython
    
    #single core:
    @cython.boundscheck(False) 
    @cython.wraparound(False) 
    def cy_f(double[::1] x):
        y_out=np.empty(len(x))
        cdef Py_ssize_t i
        cdef double[::1] y=y_out
        for i in range(len(x)):
            y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
        return y_out
    
    #parallel:
    from cython.parallel import prange
    @cython.boundscheck(False) 
    @cython.wraparound(False)  
    def cy_par_f(double[::1] x):
        y_out=np.empty(len(x))
        cdef double[::1] y=y_out
        cdef Py_ssize_t i
        cdef Py_ssize_t n = len(x)
        for i in prange(n, nogil=True):
            y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
        return y_out
    

    Cython results in somewhat slower functions:


    Conclusion

    Obviously, testing only for one function doesn't prove anything. Also one should keep in mind, that for the choosen function-example, the bandwidth of the memory was the bottle neck for sizes larger than 10^5 elements - thus we had the same performance for numba, numexpr and cython in this region.

    In the end, the ultimative answer depends on the type of function, hardware, Python-distribution and other factors. For example Anaconda-distribution uses Intel's VML for numpy's functions and thus outperforms numba (unless it uses SVML, see this SO-post) easily for transcendental functions like exp, sin, cos and similar - see e.g. the following SO-post.

    Yet from this investigation and from my experience so far, I would state, that numba seems to be the easiest tool with best performance as long as no transcendental functions are involved.


    Plotting running times with perfplot-package:

    import perfplot
    perfplot.show(
        setup=lambda n: np.random.rand(n),
        n_range=[2**k for k in range(0,24)],
        kernels=[
            f, 
            vf,
            ne_f, 
            nb_vf, nb_par_jitf,
            cy_f, cy_par_f,
            ],
        logx=True,
        logy=True,
        xlabel='len(x)'
        )
    
    0 讨论(0)
  • 2020-11-22 02:37

    How about using numpy.vectorize.

    import numpy as np
    x = np.array([1, 2, 3, 4, 5])
    squarer = lambda t: t ** 2
    vfunc = np.vectorize(squarer)
    vfunc(x)
    # Output : array([ 1,  4,  9, 16, 25])
    
    0 讨论(0)
  • 2020-11-22 02:38

    I've tested all suggested methods plus np.array(map(f, x)) with perfplot (a small project of mine).

    Message #1: If you can use numpy's native functions, do that.

    If the function you're trying to vectorize already is vectorized (like the x**2 example in the original post), using that is much faster than anything else (note the log scale):

    If you actually need vectorization, it doesn't really matter much which variant you use.


    Code to reproduce the plots:

    import numpy as np
    import perfplot
    import math
    
    
    def f(x):
        # return math.sqrt(x)
        return np.sqrt(x)
    
    
    vf = np.vectorize(f)
    
    
    def array_for(x):
        return np.array([f(xi) for xi in x])
    
    
    def array_map(x):
        return np.array(list(map(f, x)))
    
    
    def fromiter(x):
        return np.fromiter((f(xi) for xi in x), x.dtype)
    
    
    def vectorize(x):
        return np.vectorize(f)(x)
    
    
    def vectorize_without_init(x):
        return vf(x)
    
    
    perfplot.show(
        setup=lambda n: np.random.rand(n),
        n_range=[2 ** k for k in range(20)],
        kernels=[f, array_for, array_map, fromiter, vectorize, vectorize_without_init],
        xlabel="len(x)",
    )
    
    0 讨论(0)
提交回复
热议问题