I am trying to accelerate my code and this part of it is giving me problems,
I tried to use Cython and then followed the advise given here but my pure python functio
I generally agree with the advice presented by @chepner and @juanpa.arrivillaga in the comments. Numpy is a performant library, and it is true that the underlying calculations it performs are written in C. Furthermore, the syntax is clean and it is trivial to apply scalar operations across all elements of a numpy array.
However, there actually is a way to significantly improve the performance of your code with cython thanks to the way your particular algorithm is structured if we use the following assumptions (and can tolerate ugly code):
numpy.dot
for example as all operations in your code only combine scalars with matrices.for
loop in python would be unthinkable, iterating over every index is very feasible in cython. Additionally, each item in the final output depends only on the inputs that correspond to that item's index (i.e. the 0th item uses u[0]
, PorosityProfile[0]
, etc).compute_python
function. Therefore, why waste time allocating memory for all of those intermediate numpy arrays?x**y
syntax is surprisingly slow. I use a gcc
compiler option, --ffast-math
to improve this significantly. I also use several cython compiler directives to avoid python checks and overhead.Taking all of these considerations into account, here is the modified code. It performs nearly an order of magnitude faster than the naive python version on my laptop.
sublimation.pyx
from libc.stdlib cimport malloc, free
def compute_cython(float[:] u, float[:] porosity_profile,
float[:] density_ice_profile, float[:] density_dust_profile,
float[:] density_profile):
cdef:
float dust_j, dust_f, dust_g, dust_h, dust_i
float ice_i, ice_c, ice_d, ice_e, ice_f, ice_g, ice_h
int size, i
float dt, result_dust, x, dust
float result_ice_numer, result_ice_denom, result_ice, ice
float* out
dust_j, dust_f, dust_g, dust_h, dust_i = \
250.0, 633.0, 2.513, -2.2e-3, -2.8e-6
ice_i, ice_c, ice_d, ice_e, ice_f, ice_g, ice_h = \
273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2, 6.4650e4, 1.6935e6
size = len(u)
out = <float *>malloc(size * sizeof(float))
for i in range(size):
dt = u[i] - dust_j
result_dust = dust_f + (dust_g*dt) + (dust_h*dt**2) + (dust_i*dt**3)
x = u[i] / ice_i
result_ice_numer = x**3*(ice_c + ice_d*x**2 + ice_e*x**6)
result_ice_denom = 1 + ice_f*x**2 + ice_g*x**4 + ice_h*x**8
result_ice = result_ice_numer / result_ice_denom
ice = density_ice_profile[i]*result_ice
dust = density_dust_profile[i]*result_dust
out[i] = (dust + ice)/density_profile[i]
return <float[:size]>out
setup.py
from distutils.core import setup
from Cython.Build import cythonize
from distutils.core import Extension
def create_extension(ext_name):
global language, libs, args, link_args
path_parts = ext_name.split(".")
path = "./{0}.pyx".format("/".join(path_parts))
ext = Extension(ext_name, sources=[path], libraries=libs, language=language,
extra_compile_args=args, extra_link_args=link_args)
return ext
if __name__ == "__main__":
libs = []#no external c libraries in this case
language = "c"#chooses c rather than c++ since no c++ features were used
args = ["-w", "-O3", "-ffast-math"]#assumes gcc is the compiler
link_args = []#none here, could use -fopenmp for parallel code
annotate = True#autogenerates .html files per .pyx
directives = {#saves typing @cython decorators and applies them globally
"boundscheck": False,
"wraparound": False,
"initializedcheck": False,
"cdivision": True,
"nonecheck": False,
}
ext_names = [
"sublimation",
]
extensions = [create_extension(ext_name) for ext_name in ext_names]
setup(ext_modules = cythonize(
extensions,
annotate=annotate,
compiler_directives=directives,
)
)
main.py
import numpy as np
import sublimation as sub
def compute_python(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile):
DustJ, DustF, DustG, DustH, DustI = 250.0, 633.0, 2.513, -2.2e-3, -2.8e-6
IceI, IceC, IceD, IceE, IceF, IceG, IceH = 273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2, 6.4650e4, 1.6935e6
delta = u-DustJ
result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3)
x = u/IceI
result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(1+IceF*(x**2)+IceG*(x**4)+IceH*(x**8))
return (DensityIceProfile*result_ice+DensityDustProfile*result_dust)/DensityProfile
size = 100
u = np.random.rand(size).astype(np.float32)
porosity = np.random.rand(size).astype(np.float32)
ice = np.random.rand(size).astype(np.float32)
dust = np.random.rand(size).astype(np.float32)
density = np.random.rand(size).astype(np.float32)
"""
Run these from the terminal to out the performance!
python3 -m timeit -s "from main import compute_python, u, porosity, ice, dust, density" "compute_python(u, porosity, ice, dust, density)"
python3 -m timeit -s "from main import sub, u, porosity, ice, dust, density" "sub.compute_cython(u, porosity, ice, dust, density)"
python3 -m timeit -s "import numpy as np; from main import sub, u, porosity, ice, dust, density" "np.asarray(sub.compute_cython(u, porosity, ice, dust, density))"
The first command tests the python version. (10000 loops, best of 3: 45.5 usec per loop)
The second command tests the cython version, but returns just a memoryview object. (100000 loops, best of 3: 4.63 usec per loop)
The third command tests the cython version, but converts the result to a ndarray (slower). (100000 loops, best of 3: 6.3 usec per loop)
"""
Let me know if there are any unclear parts in my explanation for how this answer works, and I hope it helps!
Update 1:
Unfortunately, I was unable to get MSYS2 and numba (which depends on LLVM) to play nice with each other, so I could not do any direct comparisons. However, following @max9111's advice, I added -march=native
to the args
list in my setup.py
file; however, the timings did not significantly differ from before.
From this great answer, it appears that there is some overhead in the automatic conversion between numpy arrays and typed memoryviews that goes on both in the initial function call (and in the return statement as well if you convert the result back). Reverting back to using a function signature like this:
ctypedef np.float32_t DTYPE_t
def compute_cython_np(
np.ndarray[DTYPE_t, ndim=1] u,
np.ndarray[DTYPE_t, ndim=1] porosity_profile,
np.ndarray[DTYPE_t, ndim=1] density_ice_profile,
np.ndarray[DTYPE_t, ndim=1] density_dust_profile,
np.ndarray[DTYPE_t, ndim=1] density_profile):
saves me about 1us per call, cutting it down to about 3.6us instead of 4.6us, which is somewhat significant, especially if the function is to be called many times. Of course, if you plan to call the function many times, it might be more efficient instead to just pass in two-dimensional numpy arrays instead, saving significant python function call overhead and amortizing the cost of numpy array -> typed memoryview
conversion. Furthermore, it might be interesting to use numpy structured arrays, which can be transformed in cython into a typed memoryview of structs, as this might put all of the data closer together in the cache and speed memory access times.
As a final note as promised in the comments earlier, here is a version using prange
that takes advantage of parallel processing. Note that this can only be used with typed memoryviews as python's GIL must be released within a prange loop (and compiled using the -fopenmp
flag for args
and link_args
:
from cython.parallel import prange
from libc.stdlib cimport malloc, free
def compute_cython_p(float[:] u, float[:] porosity_profile,
float[:] density_ice_profile, float[:] density_dust_profile,
float[:] density_profile):
cdef:
float dust_j, dust_f, dust_g, dust_h, dust_i
float ice_i, ice_c, ice_d, ice_e, ice_f, ice_g, ice_h
int size, i
float dt, result_dust, x, dust
float result_ice_numer, result_ice_denom, result_ice, ice
float* out
dust_j, dust_f, dust_g, dust_h, dust_i = \
250.0, 633.0, 2.513, -2.2e-3, -2.8e-6
ice_i, ice_c, ice_d, ice_e, ice_f, ice_g, ice_h = \
273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2, 6.4650e4, 1.6935e6
size = len(u)
out = <float *>malloc(size * sizeof(float))
for i in prange(size, nogil=True):
dt = u[i] - dust_j
result_dust = dust_f + (dust_g*dt) + (dust_h*dt**2) + (dust_i*dt**3)
x = u[i] / ice_i
result_ice_numer = x**3*(ice_c + ice_d*x**2 + ice_e*x**6)
result_ice_denom = 1 + ice_f*x**2 + ice_g*x**4 + ice_h*x**8
result_ice = result_ice_numer / result_ice_denom
ice = density_ice_profile[i]*result_ice
dust = density_dust_profile[i]*result_dust
out[i] = (dust + ice)/density_profile[i]
return <float[:size]>out
Update 2:
Following the very useful additional advice from @max9111 in the comments, I switched over all of the float[:]
declarations in my code to float[::1]
. The significance of this is that it allows the data to be stored contiguously and cython would not need to worry about the presence of a stride between elements. This allows SIMD Vectorization, which dramatically further optimizes the code. Below are the updated timings, which are generated using the following commands:
python3 -m timeit -s "from main import compute_python, u, porosity, ice, dust, density" "compute_python(u, porosity, ice, dust, density)"
python3 -m timeit -s "import numpy as np; from main import sub, u, porosity, ice, dust, density" "np.asarray(sub.compute_cython(u, porosity, ice, dust, density))"
python3 -m timeit -s "import numpy as np; from main import sub, u, porosity, ice, dust, density" "np.asarray(sub.compute_cython_p(u, porosity, ice, dust, density))"
size = 100
python: 44.7 usec per loop
cython serial: 4.44 usec per loop
cython parallel: 111 usec per loop
cython serial contiguous: 3.83 usec per loop
cython parallel contiguous: 116 usec per loop
size = 1000
python: 167 usec per loop
cython serial: 16.4 usec per loop
cython parallel: 115 usec per loop
cython serial contiguous: 8.24 usec per loop
cython parallel contiguous: 111 usec per loop
size = 10000
python: 1.32 msec per loop
cython serial: 128 usec per loop
cython parallel: 142 usec per loop
cython serial contiguous: 55.5 usec per loop
cython parallel contiguous: 150 usec per loop
size = 100000
python: 19.5 msec per loop
cython serial: 1.21 msec per loop
cython parallel: 691 usec per loop
cython serial contiguous: 473 usec per loop
cython parallel contiguous: 274 usec per loop
size = 1000000
python: 211 msec per loop
cython serial: 12.3 msec per loop
cython parallel: 5.74 msec per loop
cython serial contiguous: 4.82 msec per loop
cython parallel contiguous: 1.99 msec per loop
CodeSurgeon already gave an excellent answer using Cython. In this answer I wan't to show an alternative way using Numba.
I have created three versions. In naive_numba
I only have added an function decorator. In improved_Numba
I have manually combined the loops (every vectorized command is actually a loop). In improved_Numba_p
I have parallelized the function. Please note that there is obviously a Bug not allowing to define constant values when using the pararallel accelerator. It has also be noted that the parallelized version is only beneficial for larger input arrays. But you can also add a small wrapper which calls the single threaded or the parallelized version according to the input array size.
Code dtype=float64
import numba as nb
import numpy as np
import time
@nb.njit(fastmath=True)
def naive_Numba(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile):
DustJ, DustF, DustG, DustH, DustI = 250.0, 633.0, 2.513, -2.2e-3, -2.8e-6
IceI, IceC, IceD, IceE, IceF, IceG, IceH = 273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2, 6.4650e4, 1.6935e6
delta = u-DustJ
result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3);
x= u/IceI;
result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(1+IceF*(x**2)+IceG*(x**4)+IceH*(x**8))
return (DensityIceProfile*result_ice+DensityDustProfile*result_dust)/DensityProfile
#error_model='numpy' sets divison by 0 to NaN instead of throwing a exception, this allows vectorization
@nb.njit(fastmath=True,error_model='numpy')
def improved_Numba(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile):
DustJ, DustF, DustG, DustH, DustI = 250.0, 633.0, 2.513, -2.2e-3, -2.8e-6
IceI, IceC, IceD, IceE, IceF, IceG, IceH = 273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2, 6.4650e4, 1.6935e6
res=np.empty(u.shape[0],dtype=u.dtype)
for i in range(u.shape[0]):
delta = u[i]-DustJ
result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3);
x= u[i]/IceI
result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(1+IceF*(x**2)+IceG*(x**4)+IceH*(x**8))
res[i]=(DensityIceProfile[i]*result_ice+DensityDustProfile[i]*result_dust)/DensityProfile[i]
return res
#there is obviously a bug in Numba (declaring const values in the function)
@nb.njit(fastmath=True,parallel=True,error_model='numpy')
def improved_Numba_p(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile,DustJ, DustF, DustG, DustH, DustI,IceI, IceC, IceD, IceE, IceF, IceG, IceH):
res=np.empty((u.shape[0]),dtype=u.dtype)
for i in nb.prange(u.shape[0]):
delta = u[i]-DustJ
result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3);
x= u[i]/IceI
result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(1+IceF*(x**2)+IceG*(x**4)+IceH*(x**8))
res[i]=(DensityIceProfile[i]*result_ice+DensityDustProfile[i]*result_dust)/DensityProfile[i]
return res
u=np.array(np.random.rand(1000000),dtype=np.float32)
PorosityProfile=np.array(np.random.rand(1000000),dtype=np.float32)
DensityIceProfile=np.array(np.random.rand(1000000),dtype=np.float32)
DensityDustProfile=np.array(np.random.rand(1000000),dtype=np.float32)
DensityProfile=np.array(np.random.rand(1000000),dtype=np.float32)
DustJ, DustF, DustG, DustH, DustI = 250.0, 633.0, 2.513, -2.2e-3, -2.8e-6
IceI, IceC, IceD, IceE, IceF, IceG, IceH = 273.16, 1.843e5, 1.6357e8, 3.5519e9, 1.6670e2, 6.4650e4, 1.6935e6
#don't measure compilation overhead on first call
res=improved_Numba_p(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile,DustJ, DustF, DustG, DustH, DustI,IceI, IceC, IceD, IceE, IceF, IceG, IceH)
for i in range(1000):
res=improved_Numba_p(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile,DustJ, DustF, DustG, DustH, DustI,IceI, IceC, IceD, IceE, IceF, IceG, IceH)
print(time.time()-t1)
print(time.time()-t1)
Performance
Arraysize np.random.rand(100)
Numpy 46.8µs
naive Numba 3.1µs
improved Numba: 1.62µs
improved_Numba_p: 17.45µs
#Arraysize np.random.rand(1000000)
Numpy 255.8ms
naive Numba 18.6ms
improved Numba: 6.13ms
improved_Numba_p: 3.54ms
Code dtype=np.float32
If np.float32 is sufficient you have to explicitly declare all constant values in the function to float32. Otherwise Numba will use float64.
@nb.njit(fastmath=True,error_model='numpy')
def improved_Numba(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile):
DustJ, DustF, DustG, DustH, DustI = nb.float32(250.0), nb.float32(633.0), nb.float32(2.513), nb.float32(-2.2e-3), nb.float32(-2.8e-6)
IceI, IceC, IceD, IceE, IceF, IceG, IceH = nb.float32(273.16), nb.float32(1.843e5), nb.float32(1.6357e8), nb.float32(3.5519e9), nb.float32(1.6670e2), nb.float32(6.4650e4), nb.float32(1.6935e6)
res=np.empty(u.shape[0],dtype=u.dtype)
for i in range(u.shape[0]):
delta = u[i]-DustJ
result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3)
x= u[i]/IceI
result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(nb.float32(1)+IceF*(x**2)+IceG*(x**4)+IceH*(x**8))
res[i]=(DensityIceProfile[i]*result_ice+DensityDustProfile[i]*result_dust)/DensityProfile[i]
return res
@nb.njit(fastmath=True,parallel=True,error_model='numpy')
def improved_Numba_p(u, PorosityProfile, DensityIceProfile, DensityDustProfile, DensityProfile):
res=np.empty((u.shape[0]),dtype=u.dtype)
DustJ, DustF, DustG, DustH, DustI = nb.float32(250.0), nb.float32(633.0), nb.float32(2.513), nb.float32(-2.2e-3), nb.float32(-2.8e-6)
IceI, IceC, IceD, IceE, IceF, IceG, IceH = nb.float32(273.16), nb.float32(1.843e5), nb.float32(1.6357e8), nb.float32(3.5519e9), nb.float32(1.6670e2), nb.float32(6.4650e4), nb.float32(1.6935e6)
for i in nb.prange(u.shape[0]):
delta = u[i]-DustJ
result_dust = DustF+DustG*delta+DustH*delta**2+DustI*(delta**3)
x= u[i]/IceI
result_ice = (x**3)*(IceC+IceD*(x**2)+IceE*(x**6))/(nb.float32(1)+IceF*(x**2)+IceG*(x**4)+IceH*(x**8))
res[i]=(DensityIceProfile[i]*result_ice+DensityDustProfile[i]*result_dust)/DensityProfile[i]
return res
Performance
Arraysize np.random.rand(100).astype(np.float32)
Numpy 29.3µs
improved Numba: 1.33µs
improved_Numba_p: 18µs
Arraysize np.random.rand(1000000).astype(np.float32)
Numpy 117ms
improved Numba: 2.46ms
improved_Numba_p: 1.56ms
The comparison to the Cython version provided by @CodeSurgeon isn't really fair because he didn't compile the function with enabled AVX2 and FMA3 instructions. Numba compiles by default with -march=native which enables AVX2 and FMA3 instructions on my Core i7-4xxx.
But this makes sence if you wan't to distribute a compiled Cython version of your code, because it won't run by default on pre Haswell processors (or all Pentium and Celerons) if that optimizations are enabled. Compiling multiple code paths should be possible, but is compiler dependend and more work.