Cython function taking more time than pure python

前端 未结 2 630
灰色年华
灰色年华 2020-12-20 00:06

I am trying to accelerate my code and this part of it is giving me problems,

I tried to use Cython and then followed the advise given here but my pure python functio

相关标签:
2条回答
  • 2020-12-20 00:21

    I generally agree with the advice presented by @chepner and @juanpa.arrivillaga in the comments. Numpy is a performant library, and it is true that the underlying calculations it performs are written in C. Furthermore, the syntax is clean and it is trivial to apply scalar operations across all elements of a numpy array.

    However, there actually is a way to significantly improve the performance of your code with cython thanks to the way your particular algorithm is structured if we use the following assumptions (and can tolerate ugly code):

    • Your arrays are all one-dimensional, making iterating over each item in an array extremely trivial. We do not need to replace more difficult numpy functions like numpy.dot for example as all operations in your code only combine scalars with matrices.
    • While using a for loop in python would be unthinkable, iterating over every index is very feasible in cython. Additionally, each item in the final output depends only on the inputs that correspond to that item's index (i.e. the 0th item uses u[0], PorosityProfile[0], etc).
    • You are not interested in any of the intermediate arrays, only in the final result returned in your compute_python function. Therefore, why waste time allocating memory for all of those intermediate numpy arrays?
    • Using x**y syntax is surprisingly slow. I use a gcc compiler option, --ffast-math to improve this significantly. I also use several cython compiler directives to avoid python checks and overhead.
    • Creating numpy arrays itself can have python overhead, so I use a combination of typed memoryviews (the preferred, newer syntax anyways) and malloc-ed pointers to create the output array without interacting with python very much (only two lines, getting the output size and the return statement show significant python interaction as seen in the cython annotation files).

    Taking all of these considerations into account, here is the modified code. It performs nearly an order of magnitude faster than the naive python version on my laptop.

    sublimation.pyx

    from libc.stdlib cimport malloc, free
    
    def compute_cython(float[:] u, float[:] porosity_profile, 
            float[:] density_ice_profile, float[:] density_dust_profile, 
            float[:] density_profile):    
        cdef:
            float dust_j, dust_f, dust_g, dust_h, dust_i
            float ice_i, ice_c, ice_d, ice_e, ice_f, ice_g, ice_h
            int size, i
            float dt, result_dust, x, dust
            float result_ice_numer, result_ice_denom, result_ice, ice
            float* out
    
        dust_j, dust_f, dust_g, dust_h, dust_i = \
            250.0, 633.0, 2.513, -2.2e-3, -2.8e-6
        ice_i, ice_c, ice_d, ice_e, ice_f, ice_g, ice_h = \
            273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2, 6.4650e4, 1.6935e6
        size = len(u)
        out = <float *>malloc(size * sizeof(float))
    
        for i in range(size):
            dt = u[i] - dust_j
            result_dust = dust_f + (dust_g*dt) + (dust_h*dt**2) + (dust_i*dt**3)
            x = u[i] / ice_i
            result_ice_numer = x**3*(ice_c + ice_d*x**2 + ice_e*x**6)
            result_ice_denom = 1 + ice_f*x**2 + ice_g*x**4 + ice_h*x**8
            result_ice = result_ice_numer / result_ice_denom
            ice = density_ice_profile[i]*result_ice
            dust = density_dust_profile[i]*result_dust
            out[i] = (dust + ice)/density_profile[i]
        return <float[:size]>out
    

    setup.py

    from distutils.core import setup
    from Cython.Build import cythonize
    from distutils.core import Extension
    
    def create_extension(ext_name):
        global language, libs, args, link_args
        path_parts = ext_name.split(".")
        path = "./{0}.pyx".format("/".join(path_parts))
        ext = Extension(ext_name, sources=[path], libraries=libs, language=language,
                extra_compile_args=args, extra_link_args=link_args)
        return ext
    
    if __name__ == "__main__":
        libs = []#no external c libraries in this case
        language = "c"#chooses c rather than c++ since no c++ features were used
        args = ["-w", "-O3", "-ffast-math"]#assumes gcc is the compiler
        link_args = []#none here, could use -fopenmp for parallel code
        annotate = True#autogenerates .html files per .pyx
        directives = {#saves typing @cython decorators and applies them globally
            "boundscheck": False,
            "wraparound": False,
            "initializedcheck": False,
            "cdivision": True,
            "nonecheck": False,
        }
    
        ext_names = [
            "sublimation",
        ]
    
        extensions = [create_extension(ext_name) for ext_name in ext_names]
        setup(ext_modules = cythonize(
                extensions, 
                annotate=annotate, 
                compiler_directives=directives,
            )
        )
    

    main.py

    import numpy as np
    import sublimation as sub
    
    def compute_python(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile):
        DustJ, DustF, DustG, DustH, DustI = 250.0, 633.0, 2.513, -2.2e-3, -2.8e-6   
        IceI, IceC, IceD, IceE, IceF, IceG, IceH =  273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2,  6.4650e4, 1.6935e6
        delta = u-DustJ
        result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3)
        x = u/IceI
        result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(1+IceF*(x**2)+IceG*(x**4)+IceH*(x**8))
        return (DensityIceProfile*result_ice+DensityDustProfile*result_dust)/DensityProfile
    
    size = 100
    u = np.random.rand(size).astype(np.float32)
    porosity = np.random.rand(size).astype(np.float32)
    ice = np.random.rand(size).astype(np.float32)
    dust = np.random.rand(size).astype(np.float32)
    density = np.random.rand(size).astype(np.float32)
    
    """
    Run these from the terminal to out the performance!
    python3 -m timeit -s "from main import compute_python, u, porosity, ice, dust, density" "compute_python(u, porosity, ice, dust, density)"
    python3 -m timeit -s "from main import sub, u, porosity, ice, dust, density" "sub.compute_cython(u, porosity, ice, dust, density)"
    python3 -m timeit -s "import numpy as np; from main import sub, u, porosity, ice, dust, density" "np.asarray(sub.compute_cython(u, porosity, ice, dust, density))"
    
    The first command tests the python version. (10000 loops, best of 3: 45.5 usec per loop)
    The second command tests the cython version, but returns just a memoryview object. (100000 loops, best of 3: 4.63 usec per loop)
    The third command tests the cython version, but converts the result to a ndarray (slower). (100000 loops, best of 3: 6.3 usec per loop)
    """
    

    Let me know if there are any unclear parts in my explanation for how this answer works, and I hope it helps!


    Update 1:

    Unfortunately, I was unable to get MSYS2 and numba (which depends on LLVM) to play nice with each other, so I could not do any direct comparisons. However, following @max9111's advice, I added -march=native to the args list in my setup.py file; however, the timings did not significantly differ from before.

    From this great answer, it appears that there is some overhead in the automatic conversion between numpy arrays and typed memoryviews that goes on both in the initial function call (and in the return statement as well if you convert the result back). Reverting back to using a function signature like this:

    ctypedef np.float32_t DTYPE_t
    def compute_cython_np(
            np.ndarray[DTYPE_t, ndim=1] u, 
            np.ndarray[DTYPE_t, ndim=1] porosity_profile, 
            np.ndarray[DTYPE_t, ndim=1] density_ice_profile, 
            np.ndarray[DTYPE_t, ndim=1] density_dust_profile, 
            np.ndarray[DTYPE_t, ndim=1] density_profile):
    

    saves me about 1us per call, cutting it down to about 3.6us instead of 4.6us, which is somewhat significant, especially if the function is to be called many times. Of course, if you plan to call the function many times, it might be more efficient instead to just pass in two-dimensional numpy arrays instead, saving significant python function call overhead and amortizing the cost of numpy array -> typed memoryview conversion. Furthermore, it might be interesting to use numpy structured arrays, which can be transformed in cython into a typed memoryview of structs, as this might put all of the data closer together in the cache and speed memory access times.

    As a final note as promised in the comments earlier, here is a version using prange that takes advantage of parallel processing. Note that this can only be used with typed memoryviews as python's GIL must be released within a prange loop (and compiled using the -fopenmp flag for args and link_args:

    from cython.parallel import prange
    from libc.stdlib cimport malloc, free
    def compute_cython_p(float[:] u, float[:] porosity_profile, 
            float[:] density_ice_profile, float[:] density_dust_profile, 
            float[:] density_profile):    
        cdef:
            float dust_j, dust_f, dust_g, dust_h, dust_i
            float ice_i, ice_c, ice_d, ice_e, ice_f, ice_g, ice_h
            int size, i
            float dt, result_dust, x, dust
            float result_ice_numer, result_ice_denom, result_ice, ice
            float* out
    
        dust_j, dust_f, dust_g, dust_h, dust_i = \
            250.0, 633.0, 2.513, -2.2e-3, -2.8e-6
        ice_i, ice_c, ice_d, ice_e, ice_f, ice_g, ice_h = \
            273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2, 6.4650e4, 1.6935e6
        size = len(u)
        out = <float *>malloc(size * sizeof(float))
    
        for i in prange(size, nogil=True):
            dt = u[i] - dust_j
            result_dust = dust_f + (dust_g*dt) + (dust_h*dt**2) + (dust_i*dt**3)
            x = u[i] / ice_i
            result_ice_numer = x**3*(ice_c + ice_d*x**2 + ice_e*x**6)
            result_ice_denom = 1 + ice_f*x**2 + ice_g*x**4 + ice_h*x**8
            result_ice = result_ice_numer / result_ice_denom
            ice = density_ice_profile[i]*result_ice
            dust = density_dust_profile[i]*result_dust
            out[i] = (dust + ice)/density_profile[i]
        return <float[:size]>out
    

    Update 2:

    Following the very useful additional advice from @max9111 in the comments, I switched over all of the float[:] declarations in my code to float[::1]. The significance of this is that it allows the data to be stored contiguously and cython would not need to worry about the presence of a stride between elements. This allows SIMD Vectorization, which dramatically further optimizes the code. Below are the updated timings, which are generated using the following commands:

    python3 -m timeit -s "from main import compute_python, u, porosity, ice, dust, density" "compute_python(u, porosity, ice, dust, density)"
    python3 -m timeit -s "import numpy as np; from main import sub, u, porosity, ice, dust, density" "np.asarray(sub.compute_cython(u, porosity, ice, dust, density))"
    python3 -m timeit -s "import numpy as np; from main import sub, u, porosity, ice, dust, density" "np.asarray(sub.compute_cython_p(u, porosity, ice, dust, density))"
    
    size = 100
    python: 44.7 usec per loop
    cython serial: 4.44 usec per loop
    cython parallel: 111 usec per loop
    cython serial contiguous: 3.83 usec per loop
    cython parallel contiguous: 116 usec per loop
    
    size = 1000
    python: 167 usec per loop
    cython serial: 16.4 usec per loop
    cython parallel: 115 usec per loop
    cython serial contiguous: 8.24 usec per loop
    cython parallel contiguous: 111 usec per loop
    
    size = 10000
    python: 1.32 msec per loop
    cython serial: 128 usec per loop
    cython parallel: 142 usec per loop
    cython serial contiguous: 55.5 usec per loop
    cython parallel contiguous: 150 usec per loop
    
    size = 100000
    python: 19.5 msec per loop
    cython serial: 1.21 msec per loop
    cython parallel: 691 usec per loop
    cython serial contiguous: 473 usec per loop
    cython parallel contiguous: 274 usec per loop
    
    size = 1000000
    python: 211 msec per loop
    cython serial: 12.3 msec per loop
    cython parallel: 5.74 msec per loop
    cython serial contiguous: 4.82 msec per loop
    cython parallel contiguous: 1.99 msec per loop
    
    0 讨论(0)
  • 2020-12-20 00:42

    A solution using Numba

    CodeSurgeon already gave an excellent answer using Cython. In this answer I wan't to show an alternative way using Numba.

    I have created three versions. In naive_numba I only have added an function decorator. In improved_Numba I have manually combined the loops (every vectorized command is actually a loop). In improved_Numba_p I have parallelized the function. Please note that there is obviously a Bug not allowing to define constant values when using the pararallel accelerator. It has also be noted that the parallelized version is only beneficial for larger input arrays. But you can also add a small wrapper which calls the single threaded or the parallelized version according to the input array size.

    Code dtype=float64

    import numba as nb
    import numpy as np
    import time
    
    
    
    @nb.njit(fastmath=True)
    def naive_Numba(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile):
      DustJ, DustF, DustG, DustH, DustI = 250.0, 633.0, 2.513, -2.2e-3, -2.8e-6   
      IceI, IceC, IceD, IceE, IceF, IceG, IceH =  273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2,  6.4650e4, 1.6935e6
    
      delta = u-DustJ
      result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3);
    
      x= u/IceI;
      result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(1+IceF*(x**2)+IceG*(x**4)+IceH*(x**8))
    
      return (DensityIceProfile*result_ice+DensityDustProfile*result_dust)/DensityProfile
    
    #error_model='numpy' sets divison by 0 to NaN instead of throwing a exception, this allows vectorization
    @nb.njit(fastmath=True,error_model='numpy')
    def improved_Numba(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile):
      DustJ, DustF, DustG, DustH, DustI = 250.0, 633.0, 2.513, -2.2e-3, -2.8e-6   
      IceI, IceC, IceD, IceE, IceF, IceG, IceH =  273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2,  6.4650e4, 1.6935e6
      res=np.empty(u.shape[0],dtype=u.dtype)
    
      for i in range(u.shape[0]):
        delta = u[i]-DustJ
        result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3);
    
        x= u[i]/IceI
        result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(1+IceF*(x**2)+IceG*(x**4)+IceH*(x**8))
    
        res[i]=(DensityIceProfile[i]*result_ice+DensityDustProfile[i]*result_dust)/DensityProfile[i]
    
      return res
    
    #there is obviously a bug in Numba (declaring const values in the function)
    @nb.njit(fastmath=True,parallel=True,error_model='numpy')
    def improved_Numba_p(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile,DustJ, DustF, DustG, DustH, DustI,IceI, IceC, IceD, IceE, IceF, IceG, IceH):
      res=np.empty((u.shape[0]),dtype=u.dtype)
    
      for i in nb.prange(u.shape[0]):
        delta = u[i]-DustJ
        result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3);
    
        x= u[i]/IceI
        result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(1+IceF*(x**2)+IceG*(x**4)+IceH*(x**8))
    
        res[i]=(DensityIceProfile[i]*result_ice+DensityDustProfile[i]*result_dust)/DensityProfile[i]
    
      return res
    
    u=np.array(np.random.rand(1000000),dtype=np.float32)
    PorosityProfile=np.array(np.random.rand(1000000),dtype=np.float32)
    DensityIceProfile=np.array(np.random.rand(1000000),dtype=np.float32)
    DensityDustProfile=np.array(np.random.rand(1000000),dtype=np.float32)
    DensityProfile=np.array(np.random.rand(1000000),dtype=np.float32)
    DustJ, DustF, DustG, DustH, DustI = 250.0, 633.0, 2.513, -2.2e-3, -2.8e-6
    IceI, IceC, IceD, IceE, IceF, IceG, IceH =  273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2,  6.4650e4, 1.6935e6
    
    #don't measure compilation overhead on first call
    res=improved_Numba_p(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile,DustJ, DustF, DustG, DustH, DustI,IceI, IceC, IceD, IceE, IceF, IceG, IceH) 
    for i in range(1000):
      res=improved_Numba_p(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile,DustJ, DustF, DustG, DustH, DustI,IceI, IceC, IceD, IceE, IceF, IceG, IceH)
    
    print(time.time()-t1)
    print(time.time()-t1)
    

    Performance

    Arraysize np.random.rand(100)
    Numpy             46.8µs
    naive Numba       3.1µs
    improved Numba:   1.62µs
    improved_Numba_p: 17.45µs
    
    
    #Arraysize np.random.rand(1000000)
    Numpy             255.8ms
    naive Numba       18.6ms
    improved Numba:   6.13ms
    improved_Numba_p: 3.54ms
    

    Code dtype=np.float32

    If np.float32 is sufficient you have to explicitly declare all constant values in the function to float32. Otherwise Numba will use float64.

    @nb.njit(fastmath=True,error_model='numpy')
    def improved_Numba(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile):
      DustJ, DustF, DustG, DustH, DustI = nb.float32(250.0), nb.float32(633.0), nb.float32(2.513), nb.float32(-2.2e-3), nb.float32(-2.8e-6)
      IceI, IceC, IceD, IceE, IceF, IceG, IceH =  nb.float32(273.16), nb.float32(1.843e5), nb.float32(1.6357e8), nb.float32(3.5519e9), nb.float32(1.6670e2),  nb.float32(6.4650e4), nb.float32(1.6935e6)
      res=np.empty(u.shape[0],dtype=u.dtype)
    
      for i in range(u.shape[0]):
        delta = u[i]-DustJ
        result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3)
    
        x= u[i]/IceI
        result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(nb.float32(1)+IceF*(x**2)+IceG*(x**4)+IceH*(x**8))
    
        res[i]=(DensityIceProfile[i]*result_ice+DensityDustProfile[i]*result_dust)/DensityProfile[i]
    
      return res
    
    @nb.njit(fastmath=True,parallel=True,error_model='numpy')
    def improved_Numba_p(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile):
      res=np.empty((u.shape[0]),dtype=u.dtype)
      DustJ, DustF, DustG, DustH, DustI = nb.float32(250.0), nb.float32(633.0), nb.float32(2.513), nb.float32(-2.2e-3), nb.float32(-2.8e-6)
      IceI, IceC, IceD, IceE, IceF, IceG, IceH =  nb.float32(273.16), nb.float32(1.843e5), nb.float32(1.6357e8), nb.float32(3.5519e9), nb.float32(1.6670e2),  nb.float32(6.4650e4), nb.float32(1.6935e6)
    
      for i in nb.prange(u.shape[0]):
        delta = u[i]-DustJ
        result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3)
    
        x= u[i]/IceI
        result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(nb.float32(1)+IceF*(x**2)+IceG*(x**4)+IceH*(x**8))
    
        res[i]=(DensityIceProfile[i]*result_ice+DensityDustProfile[i]*result_dust)/DensityProfile[i]
    
      return res
    

    Performance

    Arraysize np.random.rand(100).astype(np.float32)
    Numpy             29.3µs
    improved Numba:   1.33µs
    improved_Numba_p: 18µs
    
    
    Arraysize np.random.rand(1000000).astype(np.float32)
    Numpy             117ms
    improved Numba:   2.46ms
    improved_Numba_p: 1.56ms
    

    The comparison to the Cython version provided by @CodeSurgeon isn't really fair because he didn't compile the function with enabled AVX2 and FMA3 instructions. Numba compiles by default with -march=native which enables AVX2 and FMA3 instructions on my Core i7-4xxx.

    But this makes sence if you wan't to distribute a compiled Cython version of your code, because it won't run by default on pre Haswell processors (or all Pentium and Celerons) if that optimizations are enabled. Compiling multiple code paths should be possible, but is compiler dependend and more work.

    0 讨论(0)
提交回复
热议问题