Most efficient way to map function over numpy array

前端未结

关注

 11  1330

庸人自扰 2020-11-22 02:13

What is the most efficient way to map a function over a numpy array? The way I\'ve been doing it in my current project is as follows:

import numpy as np 

x


      
      
        
          11条回答        

        
                    
            
            
                         
                
              
              
                
                   孤街浪徒
                                             
                
                
                (楼主)
            
              
              
                2020-11-22 02:36
              

            
            
                        
There are numexpr, numba and cython around, the goal of this answer is to take these possibilities into consideration.
But first let's state the obvious: no matter how you map a Python-function onto a numpy-array, it stays a Python function, that means for every evaluation:

numpy-array element must be converted to a Python-object (e.g. a Float).
all calculations are done with Python-objects, which means to have the overhead of interpreter, dynamic dispatch and immutable objects.

So which machinery is used to actually loop through the array doesn't play a big role because of the overhead mentioned above - it stays much slower than using numpy's built-in functionality.
Let's take a look at the following example:
# numpy-functionality
def f(x):
    return x+2*x*x+4*x*x*x

# python-function as ufunc
import numpy as np
vf=np.vectorize(f)
vf.__name__="vf"

np.vectorize is picked as a representative of the pure-python function class of approaches. Using perfplot (see code in the appendix of this answer) we get the following running times:

We can see, that the numpy-approach is 10x-100x faster than the pure python version. The decrease of performance for bigger array-sizes is probably because data no longer fits the cache.
It is worth also mentioning, that vectorize also uses a lot of memory, so often memory-usage is the bottle-neck (see related SO-question). Also note, that numpy's documentation on np.vectorize states that it is "provided primarily for convenience, not for performance".
Other tools should be used, when performance is desired, beside writing a C-extension from the scratch, there are following possibilities:

One often hears, that the numpy-performance is as good as it gets, because it is pure C under the hood. Yet there is a lot room for improvement!
The vectorized numpy-version uses a lot of additional memory and memory-accesses. Numexp-library tries to tile the numpy-arrays and thus get a better cache utilization:
# less cache misses than numpy-functionality
import numexpr as ne
def ne_f(x):
    return ne.evaluate("x+2*x*x+4*x*x*x")

Leads to the following comparison:

I cannot explain everything in the plot above: we can see bigger overhead for numexpr-library at the beginning, but because it utilize the cache better it is about 10 time faster for bigger arrays!

Another approach is to jit-compile the function and thus getting a real pure-C UFunc. This is numba's approach:
# runtime generated C-function as ufunc
import numba as nb
@nb.vectorize(target="cpu")
def nb_vf(x):
    return x+2*x*x+4*x*x*x

It is 10 times faster than the original numpy-approach:


However, the task is embarrassingly parallelizable, thus we also could use prange in order to calculate the loop in parallel:
@nb.njit(parallel=True)
def nb_par_jitf(x):
    y=np.empty(x.shape)
    for i in nb.prange(len(x)):
        y[i]=x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
    return y

As expected, the parallel function is slower for smaller inputs, but faster (almost factor 2) for larger sizes:


While numba specializes on optimizing operations with numpy-arrays, Cython is a more general tool. It is more complicated to extract the same performance as with numba - often it is down to llvm (numba) vs local compiler (gcc/MSVC):
%%cython -c=/openmp -a
import numpy as np
import cython

#single core:
@cython.boundscheck(False) 
@cython.wraparound(False) 
def cy_f(double[::1] x):
    y_out=np.empty(len(x))
    cdef Py_ssize_t i
    cdef double[::1] y=y_out
    for i in range(len(x)):
        y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
    return y_out

#parallel:
from cython.parallel import prange
@cython.boundscheck(False) 
@cython.wraparound(False)  
def cy_par_f(double[::1] x):
    y_out=np.empty(len(x))
    cdef double[::1] y=y_out
    cdef Py_ssize_t i
    cdef Py_ssize_t n = len(x)
    for i in prange(n, nogil=True):
        y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
    return y_out

Cython results in somewhat slower functions:


Conclusion
Obviously, testing only for one function doesn't prove anything. Also one should keep in mind, that for the choosen function-example, the bandwidth of the memory was the bottle neck for sizes larger than 10^5 elements - thus we had the same performance for numba, numexpr and cython in this region.
In the end, the ultimative answer depends on the type of function, hardware,  Python-distribution and other factors. For example Anaconda-distribution uses Intel's VML for numpy's functions and thus outperforms numba (unless it uses SVML, see this SO-post) easily for transcendental functions like exp, sin, cos and similar - see e.g. the following SO-post.
Yet from this investigation and from my experience so far, I would state, that numba seems to be the easiest tool with best performance as long as no transcendental functions are involved.

Plotting running times with perfplot-package:
import perfplot
perfplot.show(
    setup=lambda n: np.random.rand(n),
    n_range=[2**k for k in range(0,24)],
    kernels=[
        f, 
        vf,
        ne_f, 
        nb_vf, nb_par_jitf,
        cy_f, cy_par_f,
        ],
    logx=True,
    logy=True,
    xlabel='len(x)'
    )

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它11个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复