I have written a function in python to calculate Delta function in Gauss broadening, which involves 4-level loops. However, the efficiency is very low, about 10 times slower
To get the best performance I recommend Numba (easy usage, good performance). Alternatively Cython may be a good idea, but with a bit more changes to your code.
You actually got everything right and implemented a easy to understand (for a human and most important for a compiler) solution.
There are basically two ways to gain performance
Vectorize the code as @scnerd showed. This is usually a bit slower and more complex than simply compile a quite simple code, that only uses some for loops. Don't vectorize your code and than use a compiler. From a simple looping aproach this is usually some work to do and leads to a slower and more complex result. The advantage of this process is that you only need numpy, which is a standard dependency in nearly every Python project that deals with some numerical calculations.
Compile the code. If you have already a solution with a few loops and no other, or only a few non numpy functions involved this is often the simplest and fastest solution.
A solution using Numba
You do not have to change much, I changed the pow function to np.power and some slight changes to the way arrays accessed in numpy (this isn't really necessary).
import numba as nb
import numpy as np
#performance-debug info
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')
@nb.njit(fastmath=True)
def Delta_Gaussf_nb(Nw, N_bd, N_kp, hw, width,eigv):
Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=float)
for w1 in range(Nw):
for k1 in range(N_kp):
for i1 in range(N_bd):
for j1 in range(N_bd):
if ( j1 >= i1 ):
Delta_Gauss[w1,k1,i1,j1] = np.exp(np.power((eigv[k1,j1]-eigv[k1,i1]-hw[w1])/width,2))
return Delta_Gauss
Due to the 'if' the SIMD-vectorization fails. In the next step we can remove it (maybe a call outside the njited function to np.triu(Delta_Gauss)
will be necessary). I also parallelized the function.
@nb.njit(fastmath=True,parallel=True)
def Delta_Gaussf_1(Nw, N_bd, N_kp, hw, width,eigv):
Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=np.float64)
for w1 in nb.prange(Nw):
for k1 in range(N_kp):
for i1 in range(N_bd):
for j1 in range(N_bd):
Delta_Gauss[w1,k1,i1,j1] = np.exp(np.power((eigv[k1,j1]-eigv[k1,i1]-hw[w1])/width,2))
return Delta_Gauss
Performance
Nw = 20
N_bd = 20
N_kp = 20
width=20
hw = np.linspace(0., 1.0, Nw)
eigv = np.zeros((N_kp, N_bd),dtype=np.float)
Your version: 0.5s
first_compiled version: 1.37ms
parallel version: 0.55ms
These easy optimizations lead to about 1000x speedup.
BLUF: Using Numpy's full functionality, plus another neat module, you can get the Python code down over 100x faster than this raw for-loop code. Using @max9111's answer, however, you can get even faster with much cleaner code and less work.
The resulting code looks nothing like the original, so I'll do the optimization one step at a time so that the process and final code make sense. Essentially, we're going to use a lot of broadcasting in order to get Numpy to perform the looping under the hood (which is always faster than looping in Python). The result computes the full square of results, which means we're necessarily duplicating some work since the result is symmetrical, but it's easier, and honestly probably faster, to do this work in high-performance ways than to have an if
at the deepest level of looping in order to avoid the computation. This might be avoidable in Fortran, but probably not in Python. If you want the result to be identical to your provided source, we'll need to take the upper triangle of the result of my code below (which I do in the sample code below... feel free to remove the triu
call in actual production, it's not necessary).
First, we'll notice a few things. The main equation has a denominator that performs np.sqrt
, but the content of that computation doesn't change at any iteration of the loop, so we'll compute it once and re-use the result. This turns out to be minor, but we'll do it anyway. Next, the main function of the inner two loops is to perform eigv[k1][j1] - eigv[k1][i1]
, which is quite easy to vectorize. If eigv
is a matrix, then eigv[k1] - eigv[k1].T
produces a matrix where result[i1, j1] = eigv[k1][j1] - eigv[k1][i1]
. This allows us to entirely remove the innermost two loops:
def mine_Delta_Gaussf(Nw, N_bd, N_kp, hw, width, eigv):
Delta_Gauss = np.zeros((Nw, N_kp, N_bd, N_bd), dtype=float)
denom = np.sqrt(2.0 * np.pi) * width
eigv = np.matrix(eigv)
for w1 in range(Nw):
for k1 in range(N_kp):
this_eigv = (eigv[k1] - eigv[k1].T - hw[w1])
v = np.power(this_eigv / width, 2)
Delta_Gauss[w1, k1, :, :] = np.exp(-0.5 * v) / denom
# Take the upper triangle to make the result exactly equal to the original code
return np.triu(Delta_Gauss)
Well, now that we're on the broadcasting bandwagon, it really seems like the remaining two loops should be possible to remove in the same way. As it happens, it is easy! The only thing we need k1
for is to get the row out of eigv
that we're trying to pairwise-subtract... so why not do this to all rows at the same time? We're currently basically subtracting matrices of shapes (1, B) - (B, 1)
for each of N
rows in eigv
(where B
is is N_bd
). We can abuse broadcasting to do this for all rows of eigv
simultaneously by subtracting matrices of shapes (N, 1, B) - (N, B, 1)
(where N
is N_kp
):
def mine_Delta_Gaussf(Nw, N_bd, N_kp, hw, width, eigv):
Delta_Gauss = np.zeros((Nw, N_kp, N_bd, N_bd), dtype=float)
denom = np.sqrt(2.0 * np.pi) * width
for w1 in range(Nw):
this_eigv = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2) - hw[w1]
v = np.power(this_eigv / width, 2)
Delta_Gauss[w1, :, :, :] = np.exp(-0.5 * v) / denom
return np.triu(Delta_Gauss)
The next step should be clear now. We're only using w1
to index hw
, so let's do some more broadcasting to make numpy
do the looping instead. We're currently subtracting a scalar value from a matrix of shape (N, B, B)
, so to get the resulting matrix for each of the W
values in hw
, we need to perform subtraction on matrices of the shapes (1, N, B, B) - (W, 1, 1, 1)
and numpy
will broadcast everything to produce a matrix of the shape (W, N, B, B)
:
def Delta_Gaussf(hw, width, eigv):
eigv_sub = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2)
w_sub = np.expand_dims(eigv_sub, 0) - np.reshape(hw, (0, 1, 1, 1))
v = np.power(w_sub / width, 2)
denom = np.sqrt(2.0 * np.pi) * width
Delta_Gauss = np.exp(-0.5 * v) / denom
return np.triu(Delta_Gauss)
On my example data, this code is ~100x faster (~900ms to ~10ms). Your mileage might vary.
But wait! There's more! Since our code is all numeric/numpy/python, we can use another handy module called numba to compile this function into an equivalent one with higher performance. Under the hood, it's basically reading what functions we're calling and converting the function into C-types and C-calls to remove the Python function call overhead. It's doing more than that, but that gives the jist of where we're going to gain benefit. To gain this benefit is trivial in this case:
import numba
@numba.jit
def Delta_Gaussf(hw, width, eigv):
eigv_sub = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2)
w_sub = np.expand_dims(eigv_sub, 0) - np.reshape(hw, (0, 1, 1, 1))
v = np.power(w_sub / width, 2)
denom = np.sqrt(2.0 * np.pi) * width
Delta_Gauss = np.exp(-0.5 * v) / denom
return np.triu(Delta_Gauss)
The resulting function is down to about ~7ms on my sample data, down from ~10ms, just by adding that decorator. Pretty nice for no effort.
EDIT: @max9111 gave a better answer that points out that numba
works much better with the loop syntax than with numpy
broadcasting code. With almost no work besides the removal of the inner if
statement, he shows that numba.jit
can be made to get the almost original code even faster. The result is much cleaner, in that you still have just the single innermost equation that shows what each value is, and you don't have to follow the magical broadcasting used above. I highly recommend using his answer.
Conclusion
For my given sample data (Nw = 20, N_bd = 20, N_kp = 20), my final runtimes are the following (I've included timings on the same computer for @max9111's solution, first without using parallel execution and then with it on my 2-core VM):
Original code: ~900 ms
Fortran estimate: ~90 ms (based on OP saying it was ~10x faster)
Final numpy code: ~10 ms
Final code with numba.jit: ~7 ms
max9111's solution (serial): ~4ms
max9111 (parallel 2-core): ~3ms
Overall vectorized speedup: ~130x
max9111's numba speedup: ~300x (potentially more with more cores)
I don't know how fast exactly your Fortran code is, but it looks like proper usage of numpy
allows you to easily beat it by an order of magnitude, and @max9111's numba
solution gives you potentially another order of magnitude.