I work with topographic data. For one particular problem, I have written a function in Python which uses a moving window of a particular size to zip through a matrix (grid o
Q : This problem takes 11 hours to run on a small area,... stay tuned, we can and we'll get under 20 [min] !!
given due explanations were provided, for which I thank the O/P author:
# DEM.shape = [nrows, ncols] = [ 1355, 1165 ]
# DEM.dtype = float32
# .flags = C_CONTIGUOUS : True
# F_CONTIGUOUS : False
# OWNDATA : True
# WRITEABLE : True
# ALIGNED : True
# WRITEBACKIFCOPY : False
# UPDATEIFCOPY : False
I tried to review the code and setup a mock-up of a bit more efficient code, before moving into putting all popular and ready-to-use numpy + numba
steroids in, and the interim numpy
-only result works
on a sample of [100,100]
DEM-grid for about ~ 6 [s]
at the said kernel window width w = 10
The same, for [200,200]
DEM-grid, takes under ~ 36 [s]
- obviously, the scaling is ~ O( N^2 )
The same, for [1000,1000]
DEM-grid, took under ~ 1077 [s] ~ 17.6 [min]
wow !
A field .jit
trial on [1000,1000]
DEM-grid is currently under the test and will update the post once finished + once the numba.jit()
code will enjoy to run the further accelerated results
If you @morrismc test your as-is code now, on a [100,100]
-matrix, we can already guess the achieved range of the principal speedup, even before running tests are completed.
>>> pass; import numpy as np
>>> from zmq import Stopwatch; clk = Stopwatch()
>>>
>>> size = 100; demF32 = np.random.random( ( size, size ) ).astype( np.float32 ); resF32 = demF32.copy(); clk.start(); _ = RMSH_det( demF32, 10, resF32 ); t = clk.stop(); print( "{1:>13d} [us]\nNumOf_np.nan-s was {0:d}".format( _, t ) )
6492192 [us]
NumOf_np.nan-s was 0
>>> size = 200; demF32 = np.random.random( ( size, size ) ).astype( np.float32 ); resF32 = demF32.copy(); clk.start(); _ = RMSH_det( demF32, 10, resF32 ); t = clk.stop(); print( "{1:>13d} [us]\nNumOf_np.nan-s was {0:d}".format( _, t ) )
35650629 [us]
NumOf_np.nan-s was 0
>>> size = 1000; demF32 = np.random.random( ( size, size ) ).astype( np.float32 ); resF32 = demF32.copy(); clk.start(); _ = RMSH_det( demF32, 10, resF32 ); t = clk.stop(); print( "{1:>13d} [us]\nNumOf_np.nan-s was {0:d}".format( _, t ) )
1058702889 [us]
NumOf_np.nan-s was 0
All this on scipy
1.2.1, thus without the benefits from 1.3.1 possible further speedups
numba.jit()
LLVM-compiled code. Ooops, slower ?numba.jit()
-acceleration has shown about 200 [ms]
worse runtime on [100,100]
DEM-grid, with signature having been specified ( so no ad-hoc analysis costs were accrued here ) and nogil = True
( '0.43.1+0.g8dabe7abe.dirty' not being the most recent, yet )
Guess there is nothing more to gain here, without moving the game into compiled Cython
territories, yet having about tens of minutes instead of tens of hours, the Alea Iacta Est - just the numpy
smart-vectorised code rulez!
If the original algorithm was correct (and some doubts were left in the source-code for any further improvement work), any attempt to run some other form of [PARALLEL]
code-execution-flow will not help here ( kernel-windows[w,w] are very-small and non-contiguous areas of the DEM-grid memory-layout, memory-I/O costs are dominant part of the run-time budget here, and some nicer indexing may improve the cache-line re-use, yet the overall efforts are well beyond the budget, as the target of going down from ~ 11 [hrs]
to about ~ 6 [hrs]
was more than successfully met with about ~ 20 [min]
runtimes achievable for [1300,1100]
float32 DEM-grids
The code was left as-is ( non-PEP-8 ), because of all the add-on didactic value for the [DOC.me], [TEST.me]
and [PERF.me]
phases of QA, so all kind PEP-isto-evangelisators do bear with the O/P author's view onto a full-scree-width layout left, so as to allow to understand WHY and to improve the code, which with stripped-off-comments would lose her/his way forwards in improving the code performance further on. Thx.
@jit( [ "int32( float32[:,:], int32, float32[:,:] )", ], nogil = True ) # numba.__version__ '0.43.1+0.g8dabe7abe.dirty'
def RMSH_det_jit( DEMf32, w, rmsRESULTf32 ): # pre-allocate rmsRESULTf32[:,:] externally
#import numpy as np
#from scipy import signal
#
# [nrows, ncols] = np.shape( DEM ) # avoid ~ [ 1355, 1165 ]
# # DEM.dtype = float32
# # .flags = C_CONTIGUOUS : True
# # F_CONTIGUOUS : False
# # OWNDATA : True
# # WRITEABLE : True
# # ALIGNED : True
# # WRITEBACKIFCOPY : False
# # UPDATEIFCOPY : False
#
rmsRESULTf32[:,:] = np.nan # .STO[:,:] np.nan-s, using in-place assignment into the by-ref passed, externally pre-allocated np.ndarray
dtdWIN = np.ones( ( 2 * w - 1, # .ALLOC once, re-use 1M+ times
2 * w - 1 ) )
a_div_by_nz_minus1 = 1. / ( dtdWIN.size - 1 ) # .SET float CONST with about a ~1M+ re-use
a_num_of_NaNs = 0 # .SET i4 bonus value, ret'd as a side-effect of the signature ...
# rms = DEM*np.nan # avoid ( pre-alloc rmsRESULTf32 ) externally create and pass a right-sized, empty array to store all results
# nw = ( w * 2 )**2
# x = np.arange( 0, nw )
# 11..1344
#or i in np.arange( w+1, nrows-w ): # w ~ 10 -> [11:1344, 11:1154]
for i in np.arange( w+1, DEMf32.shape[0]-w ): # ??? never touches DEM-row/column[0]?? or off-by-one indexing error ???
fromI = i - w # .UPD ALAP
tillI = i + w - 1 # .UPD ALAP upper bound index excluded ( this is how a code in [ np.arange(...)[0]:np.arange(...)[-1] ] works )
# 11..1154
#or j in np.arange( w+1, ncols-w ):
for j in np.arange( w+1, DEMf32.shape[1]-w ):
fromJ = j - w # .UPD ALAP
tillJ = j + w - 1 # .UPD ALAP upper bound index excluded ( this is how a code in [ np.arange(...)[0]:np.arange(...)[-1] ] works )
# 1..1334:21..1354 # ??? never touches first/last DEM-row/column??
# d1 = np.int64( np.arange( i-w, i+w ) ) # AVOID: 1M+ times allocated, yet never consumed, but their edge values
# d2 = np.int64( np.arange( j-w, j+w ) ) # AVOID: 1M+ times allocated, yet never consumed, but their edge values
# win = DEM[ d1[0]:d1[-1], # AVOID: while a .view-only, no need to 1M+ times instantiate a "kernel"-win(dow] ( this will create a np.view into the original DEM, not a copy ! )
# d2[0]:d2[-1] # ?.or.? NOT a .view-only, but a new .copy() instantiated, so as to call .detrend() w/o in-place modifying DEMf32 ???
# ] # ?.or.? NOT a .view-only, but a new .copy() instantiated, so as to call .detrend() w/o in-place modifying DEMf32 ???
dtdWIN[:,:] = DEMf32[fromI:tillI, fromJ:tillJ] # NOT a .view-only, but a .copy() re-populated into a just once and only once pre-allocated dtdWIN, via an in-place copy
#f np.max( np.isnan( win ) ) == 1: # AVOID: 1M+ times full-range scan, while any first np.nan decides the game and no need to scan "the rest"
if np.any( np.isnan( dtdWIN ) ): # "density" of np.nan-s determine, if this is a good idea to pre-store
a_num_of_NaNs += 1 # .INC
continue # .NOP/LOOP from here, already pre-stored np.nan-s for this case
# rms[i,j] = np.nan # DUP ( already stored in initialisation ... )
else:
#in = signal.detrend( win, type = 'linear' ) # REALLY?: in-place modification of DEM-matrix ???
dtdWIN = signal.detrend( dtdWIN, type = 'linear' ) # in scipy-v1.3.1+ can mod in-place, overwrite_data = True ) # REMOVE OLS-fit-linear trend
dtdWIN = signal.detrend( dtdWIN, type = 'constant' ) # in scipy-v1.3.1+ can mod in-place, overwrite_data = True ) # REMOVE mean
#z = np.reshape( win, -1 ) # AVOID:~1M+ re-counting constant value, known from w directly
#nz = np.size( z ) # AVOID:~1M+ re-counting constant value, known from w directly
#rootms = np.sqrt( 1 / ( nz - 1 ) * np.sum( ( z - np.mean( z ) )**2 ) )
#rms[i,j] = rootms
rmsRESULTf32[i,j] = np.sqrt( a_div_by_nz_minus1 # .STO a "scaled"
* np.dot( dtdWIN,
dtdWIN.T
).sum()
# np.sum( ( dtdWIN # SUM of
# # - dtdWIN.mean() # mean-removed ( ALREADY done via scipy.signal.detrend( 'const' ) above )
# )**2 # SQUARES
# )
) # ROOT
return( a_num_of_NaNs ) # ret i4
In some cases this is very easy to do, if all functions which you use are supported. In your code win = signal.detrend(win, type = 'linear')
is the part you have to implement in Numba, because this function isn't supported.
Implementing detrend in Numba
If you look at the source-code of detrend, and extract the relevant parts for your problem, it may look like this:
@nb.njit()
def detrend(w):
Npts=w.shape[0]
A=np.empty((Npts,2),dtype=w.dtype)
for i in range(Npts):
A[i,0]=1.*(i+1) / Npts
A[i,1]=1.
coef, resids, rank, s = np.linalg.lstsq(A, w.T)
out=w.T- np.dot(A, coef)
return out.T
I also implemented a faster solution for np.max(np.isnan(win)) == 1
@nb.njit()
def isnan(win):
for i in range(win.shape[0]):
for j in range(win.shape[1]):
if np.isnan(win[i,j]):
return True
return False
Main function
As I used Numba here, the parallelization is very simple, just a prange on the outer loop and
import numpy as np
import numba as nb
@nb.njit(parallel=True)
def RMSH_det_nb(DEM, w):
[nrows, ncols] = np.shape(DEM)
#create an empty array to store result
rms = DEM*np.nan
for i in nb.prange(w+1,nrows-w):
for j in range(w+1,ncols-w):
win = DEM[i-w:i+w-1,j-w:j+w-1]
if isnan(win):
rms[i,j] = np.nan
else:
win = detrend(win)
z = win.flatten()
nz = z.size
rootms = np.sqrt(1 / (nz - 1) * np.sum((z-np.mean(z))**2))
rms[i,j] = rootms
return rms
Timings (small example)
w = 10
DEM=np.random.rand(100, 100).astype(np.float32)
res1=RMSH_det(DEM, w)
res2=RMSH_det_nb(DEM, w)
print(np.allclose(res1,res2,equal_nan=True))
#True
%timeit res1=RMSH_det(DEM, w)
#1.59 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res2=RMSH_det_nb(DEM, w) #approx. 55 times faster
#29 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Timings for larger arrays
w = 10
DEM=np.random.rand(1355, 1165).astype(np.float32)
%timeit res2=RMSH_det_nb(DEM, w)
#6.63 s ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[Edit] Implemenation using normal Equations
Overdetermined system
This method has a lower numerical precision. Although this solution is quite a lot faster.
@nb.njit()
def isnan(win):
for i in range(win.shape[0]):
for j in range(win.shape[1]):
if win[i,j]==np.nan:
return True
return False
@nb.njit()
def detrend(w):
Npts=w.shape[0]
A=np.empty((Npts,2),dtype=w.dtype)
for i in range(Npts):
A[i,0]=1.*(i+1) / Npts
A[i,1]=1.
coef, resids, rank, s = np.linalg.lstsq(A, w.T)
out=w.T- np.dot(A, coef)
return out.T
@nb.njit()
def detrend_2(w,T1,A):
T2=np.dot(A.T,w.T)
coef=np.linalg.solve(T1,T2)
out=w.T- np.dot(A, coef)
return out.T
@nb.njit(parallel=True)
def RMSH_det_nb_normal_eq(DEM,w):
[nrows, ncols] = np.shape(DEM)
#create an empty array to store result
rms = DEM*np.nan
Npts=w*2-1
A=np.empty((Npts,2),dtype=DEM.dtype)
for i in range(Npts):
A[i,0]=1.*(i+1) / Npts
A[i,1]=1.
T1=np.dot(A.T,A)
nz = Npts**2
for i in nb.prange(w+1,nrows-w):
for j in range(w+1,ncols-w):
win = DEM[i-w:i+w-1,j-w:j+w-1]
if isnan(win):
rms[i,j] = np.nan
else:
win = detrend_2(win,T1,A)
rootms = np.sqrt(1 / (nz - 1) * np.sum((win-np.mean(win))**2))
rms[i,j] = rootms
return rms
Timings
w = 10
DEM=np.random.rand(100, 100).astype(np.float32)
res1=RMSH_det(DEM, w)
res2=RMSH_det_nb(DEM, w)
print(np.allclose(res1,res2,equal_nan=True))
#True
%timeit res1=RMSH_det(DEM, w)
#1.59 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res2=RMSH_det_nb_normal_eq(DEM,w)
#7.97 ms ± 89.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Optimized solution using normal equations
Temporary arrays are reused to avoid costly memory allocations and a custom implementation for matrix multiplication is used. This is only recommendable for very small matrices, in most other cases np.dot (sgeemm) will be a lot faster.
@nb.njit()
def matmult_2(A,B,out):
for j in range(B.shape[1]):
acc1=nb.float32(0)
acc2=nb.float32(0)
for k in range(B.shape[0]):
acc1+=A[0,k]*B[k,j]
acc2+=A[1,k]*B[k,j]
out[0,j]=acc1
out[1,j]=acc2
return out
@nb.njit(fastmath=True)
def matmult_mod(A,B,w,out):
for j in range(B.shape[1]):
for i in range(A.shape[0]):
acc=nb.float32(0)
acc+=A[i,0]*B[0,j]+A[i,1]*B[1,j]
out[j,i]=acc-w[j,i]
return out
@nb.njit()
def detrend_2_opt(w,T1,A,Tempvar_1,Tempvar_2):
T2=matmult_2(A.T,w.T,Tempvar_1)
coef=np.linalg.solve(T1,T2)
return matmult_mod(A, coef,w,Tempvar_2)
@nb.njit(parallel=True)
def RMSH_det_nb_normal_eq_opt(DEM,w):
[nrows, ncols] = np.shape(DEM)
#create an empty array to store result
rms = DEM*np.nan
Npts=w*2-1
A=np.empty((Npts,2),dtype=DEM.dtype)
for i in range(Npts):
A[i,0]=1.*(i+1) / Npts
A[i,1]=1.
T1=np.dot(A.T,A)
nz = Npts**2
for i in nb.prange(w+1,nrows-w):
Tempvar_1=np.empty((2,Npts),dtype=DEM.dtype)
Tempvar_2=np.empty((Npts,Npts),dtype=DEM.dtype)
for j in range(w+1,ncols-w):
win = DEM[i-w:i+w-1,j-w:j+w-1]
if isnan(win):
rms[i,j] = np.nan
else:
win = detrend_2_opt(win,T1,A,Tempvar_1,Tempvar_2)
rootms = np.sqrt(1 / (nz - 1) * np.sum((win-np.mean(win))**2))
rms[i,j] = rootms
return rms
Timings
w = 10
DEM=np.random.rand(100, 100).astype(np.float32)
res1=RMSH_det(DEM, w)
res2=RMSH_det_nb_normal_eq_opt(DEM, w)
print(np.allclose(res1,res2,equal_nan=True))
#True
%timeit res1=RMSH_det(DEM, w)
#1.59 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res2=RMSH_det_nb_normal_eq_opt(DEM,w)
#4.66 ms ± 87.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Timings for isnan
This function is a completely other implementation. It is much faster if a NaN is quite at the beginning of the array, but anyway even if not there is some speedup. I benchmarked it with small arrays (approx. window size) and a large size suggested by @user3666197.
case_1=np.full((20,20),np.nan)
case_2=np.full((20,20),0.)
case_2[10,10]=np.nan
case_3=np.full((20,20),0.)
case_4 = np.full( ( int( 1E4 ), int( 1E4 ) ),np.nan)
case_5 = np.ones( ( int( 1E4 ), int( 1E4 ) ) )
%timeit np.any(np.isnan(case_1))
%timeit np.any(np.isnan(case_2))
%timeit np.any(np.isnan(case_3))
%timeit np.any(np.isnan(case_4))
%timeit np.any(np.isnan(case_5))
#2.75 µs ± 73.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#2.75 µs ± 46.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#2.76 µs ± 32.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#81.3 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#86.7 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit isnan(case_1)
%timeit isnan(case_2)
%timeit isnan(case_3)
%timeit isnan(case_4)
%timeit isnan(case_5)
#244 ns ± 5.02 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#357 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#475 ns ± 9.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#235 ns ± 0.933 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#58.8 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)