A long term puzzle, how to optimize multi-level loops in python?

前端未结

关注

 2  1328

I have written a function in python to calculate Delta function in Gauss broadening, which involves 4-level loops. However, the efficiency is very low, about 10 times slower

相关标签:

2条回答

独厮守ぢ

2020-12-19 21:46
Simply compile it

To get the best performance I recommend Numba (easy usage, good performance). Alternatively Cython may be a good idea, but with a bit more changes to your code.

You actually got everything right and implemented a easy to understand (for a human and most important for a compiler) solution.

There are basically two ways to gain performance
1. Vectorize the code as @scnerd showed. This is usually a bit slower and more complex than simply compile a quite simple code, that only uses some for loops. Don't vectorize your code and than use a compiler. From a simple looping aproach this is usually some work to do and leads to a slower and more complex result. The advantage of this process is that you only need numpy, which is a standard dependency in nearly every Python project that deals with some numerical calculations.
2. Compile the code. If you have already a solution with a few loops and no other, or only a few non numpy functions involved this is often the simplest and fastest solution.
A solution using Numba

You do not have to change much, I changed the pow function to np.power and some slight changes to the way arrays accessed in numpy (this isn't really necessary).
```
import numba as nb
import numpy as np

#performance-debug info
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')

@nb.njit(fastmath=True)
def Delta_Gaussf_nb(Nw, N_bd, N_kp, hw, width,eigv):
    Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=float)
    for w1 in range(Nw):
        for k1 in range(N_kp):
            for i1 in range(N_bd):
                for j1 in range(N_bd):
                    if ( j1 >= i1 ):
                        Delta_Gauss[w1,k1,i1,j1] = np.exp(np.power((eigv[k1,j1]-eigv[k1,i1]-hw[w1])/width,2))
    return Delta_Gauss
```
Due to the 'if' the SIMD-vectorization fails. In the next step we can remove it (maybe a call outside the njited function to np.triu(Delta_Gauss) will be necessary). I also parallelized the function.
```
@nb.njit(fastmath=True,parallel=True)
def Delta_Gaussf_1(Nw, N_bd, N_kp, hw, width,eigv):
    Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=np.float64)
    for w1 in nb.prange(Nw):
        for k1 in range(N_kp):
            for i1 in range(N_bd):
                for j1 in range(N_bd):
                    Delta_Gauss[w1,k1,i1,j1] = np.exp(np.power((eigv[k1,j1]-eigv[k1,i1]-hw[w1])/width,2))
    return Delta_Gauss
```
Performance
```
Nw = 20
N_bd = 20
N_kp = 20
width=20
hw = np.linspace(0., 1.0, Nw) 
eigv = np.zeros((N_kp, N_bd),dtype=np.float) 

Your version:           0.5s
first_compiled version: 1.37ms
parallel version:       0.55ms
```
These easy optimizations lead to about 1000x speedup.
0 讨论(0)
发布评论:

提交评论
- 加载中...
刺人心

2020-12-19 21:54
BLUF: Using Numpy's full functionality, plus another neat module, you can get the Python code down over 100x faster than this raw for-loop code. Using @max9111's answer, however, you can get even faster with much cleaner code and less work.

The resulting code looks nothing like the original, so I'll do the optimization one step at a time so that the process and final code make sense. Essentially, we're going to use a lot of broadcasting in order to get Numpy to perform the looping under the hood (which is always faster than looping in Python). The result computes the full square of results, which means we're necessarily duplicating some work since the result is symmetrical, but it's easier, and honestly probably faster, to do this work in high-performance ways than to have an if at the deepest level of looping in order to avoid the computation. This might be avoidable in Fortran, but probably not in Python. If you want the result to be identical to your provided source, we'll need to take the upper triangle of the result of my code below (which I do in the sample code below... feel free to remove the triu call in actual production, it's not necessary).

First, we'll notice a few things. The main equation has a denominator that performs np.sqrt, but the content of that computation doesn't change at any iteration of the loop, so we'll compute it once and re-use the result. This turns out to be minor, but we'll do it anyway. Next, the main function of the inner two loops is to perform eigv[k1][j1] - eigv[k1][i1], which is quite easy to vectorize. If eigv is a matrix, then eigv[k1] - eigv[k1].T produces a matrix where result[i1, j1] = eigv[k1][j1] - eigv[k1][i1]. This allows us to entirely remove the innermost two loops:
```
def mine_Delta_Gaussf(Nw, N_bd, N_kp, hw, width, eigv):
    Delta_Gauss = np.zeros((Nw, N_kp, N_bd, N_bd), dtype=float)
    denom = np.sqrt(2.0 * np.pi) * width
    eigv = np.matrix(eigv)
    for w1 in range(Nw):
        for k1 in range(N_kp):
            this_eigv = (eigv[k1] - eigv[k1].T - hw[w1])
            v = np.power(this_eigv / width, 2)
            Delta_Gauss[w1, k1, :, :] = np.exp(-0.5 * v) / denom

    # Take the upper triangle to make the result exactly equal to the original code
    return np.triu(Delta_Gauss)
```
Well, now that we're on the broadcasting bandwagon, it really seems like the remaining two loops should be possible to remove in the same way. As it happens, it is easy! The only thing we need k1 for is to get the row out of eigv that we're trying to pairwise-subtract... so why not do this to all rows at the same time? We're currently basically subtracting matrices of shapes (1, B) - (B, 1) for each of N rows in eigv (where B is is N_bd). We can abuse broadcasting to do this for all rows of eigv simultaneously by subtracting matrices of shapes (N, 1, B) - (N, B, 1) (where N is N_kp):
```
def mine_Delta_Gaussf(Nw, N_bd, N_kp, hw, width, eigv):
    Delta_Gauss = np.zeros((Nw, N_kp, N_bd, N_bd), dtype=float)
    denom = np.sqrt(2.0 * np.pi) * width
    for w1 in range(Nw):
        this_eigv = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2) - hw[w1]
        v = np.power(this_eigv / width, 2)
        Delta_Gauss[w1, :, :, :] = np.exp(-0.5 * v) / denom
    return np.triu(Delta_Gauss)
```
The next step should be clear now. We're only using w1 to index hw, so let's do some more broadcasting to make numpy do the looping instead. We're currently subtracting a scalar value from a matrix of shape (N, B, B), so to get the resulting matrix for each of the W values in hw, we need to perform subtraction on matrices of the shapes (1, N, B, B) - (W, 1, 1, 1) and numpy will broadcast everything to produce a matrix of the shape (W, N, B, B):
```
def Delta_Gaussf(hw, width, eigv):
    eigv_sub = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2)
    w_sub = np.expand_dims(eigv_sub, 0) - np.reshape(hw, (0, 1, 1, 1))
    v = np.power(w_sub / width, 2)
    denom = np.sqrt(2.0 * np.pi) * width
    Delta_Gauss = np.exp(-0.5 * v) / denom
    return np.triu(Delta_Gauss)
```
On my example data, this code is ~100x faster (~900ms to ~10ms). Your mileage might vary.

But wait! There's more! Since our code is all numeric/numpy/python, we can use another handy module called numba to compile this function into an equivalent one with higher performance. Under the hood, it's basically reading what functions we're calling and converting the function into C-types and C-calls to remove the Python function call overhead. It's doing more than that, but that gives the jist of where we're going to gain benefit. To gain this benefit is trivial in this case:
```
import numba

@numba.jit
def Delta_Gaussf(hw, width, eigv):
    eigv_sub = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2)
    w_sub = np.expand_dims(eigv_sub, 0) - np.reshape(hw, (0, 1, 1, 1))
    v = np.power(w_sub / width, 2)
    denom = np.sqrt(2.0 * np.pi) * width
    Delta_Gauss = np.exp(-0.5 * v) / denom
    return np.triu(Delta_Gauss)
```
The resulting function is down to about ~7ms on my sample data, down from ~10ms, just by adding that decorator. Pretty nice for no effort.

EDIT: @max9111 gave a better answer that points out that numba works much better with the loop syntax than with numpy broadcasting code. With almost no work besides the removal of the inner if statement, he shows that numba.jit can be made to get the almost original code even faster. The result is much cleaner, in that you still have just the single innermost equation that shows what each value is, and you don't have to follow the magical broadcasting used above. I highly recommend using his answer.

Conclusion

For my given sample data (Nw = 20, N_bd = 20, N_kp = 20), my final runtimes are the following (I've included timings on the same computer for @max9111's solution, first without using parallel execution and then with it on my 2-core VM):
```
Original code:               ~900 ms
Fortran estimate:            ~90 ms (based on OP saying it was ~10x faster)
Final numpy code:            ~10 ms
Final code with numba.jit:   ~7 ms
max9111's solution (serial): ~4ms
max9111 (parallel 2-core):   ~3ms

Overall vectorized speedup: ~130x
max9111's numba speedup: ~300x (potentially more with more cores)
```
I don't know how fast exactly your Fortran code is, but it looks like proper usage of numpy allows you to easily beat it by an order of magnitude, and @max9111's numba solution gives you potentially another order of magnitude.
0 讨论(0)
发布评论:

提交评论
- 加载中...

A long term puzzle, how to optimize multi-level loops in python?

Simply compile it