问题
I simply want to do: C_i=\Sum_k (A_i -B_k)^2
I saw that this calculation is faster with a simple for loop
than with the numpy.subtract.outer
! Anyway I feel that numpy.einsum
would be the fastest. I could not understand numpy.einsum
that well. Can anyone please help me out? Additionally, it would be great if someone explains how a general summation expression consisting of vector/matrices can be written with numpy.einsum
?
I did not find solution for this particular problem on the web. Sorry if duplicate in some manner.
MWE with loop and numpy.subtract.outer
--
A)With loop
import timeit
code1="""
import numpy as np
N=10000;
a=np.random.rand(N); b=10*(np.random.rand(N)-0.5);
def A1(x,y):
Nx=len(x)
z=np.zeros(Nx)
for i in np.arange(Nx):
z[i]=np.sum((x[i]-y)*(x[i]-y))
return z
C1=A1(a,b)"""
elapsed_time = timeit.timeit(code1, number=10)/10
print "time=", elapsed_time
B) With numpy.subtract.outer
import timeit
code1="""
import numpy as np
N=10000;
a=np.random.rand(N); b=10*(np.random.rand(N)-0.5);
def A2(x,y):
C=np.subtract.outer(x,y);
return np.sum(C*C, axis=1)
C2=A2(a,b)"""
elapsed_time = timeit.timeit(code1, number=10)/10
print "time=", elapsed_time
For N=10000 the loop becomes faster. For N=100, the outer subtract becomes faster. For N=10^5, outer subtract faces memory issue on my desktop with 8GB ram!
回答1:
Use at least Numba, or a Fortran Implementation
Both of your functions are quite slow. Python loops are very inefficient (A1), and allocating large temporary arrays is also slow (A2 and partially also A1).
Naive Numba implementation for small arrays
import numba as nb
import numpy as np
@nb.njit(parallel=True, fastmath=True)
def A_nb_p(x,y):
z=np.empty(x.shape[0])
for i in nb.prange(x.shape[0]):
TMP=0.
for j in range(y.shape[0]):
TMP+=(x[i]-y[j])**2
z[i]=TMP
return z
Timings
import time
N=int(1e5)
a=np.random.rand(N)
b=10*(np.random.rand(N)-0.5)
t1=time.time()
res_1=A1(a,b)
print(time.time()-t1)
#95.30195426940918 s
t1=time.time()
res_2=A_nb_p(a,b)
print(time.time()-t1)
#0.28573083877563477 s
#A2 is too slow to measure
As written above this is a naive implementation on larger arrays, since Numba isn't able to do the calculation blockwise, which leads to a lot of cache misses and therefore bad performance. Some Fortran or C- compiler should be able to do at least the following optimization (block-wise evaluation) automatically.
Implementation for large arrays
@nb.njit(parallel=True, fastmath=True)
def A_nb_p_2(x,y):
blk_s=1024
z=np.zeros(x.shape[0])
num_blk_x=x.shape[0]//blk_s
num_blk_y=y.shape[0]//blk_s
for ii in nb.prange(num_blk_x):
for jj in range(num_blk_y):
for i in range(blk_s):
TMP=z[ii*blk_s+i]
for j in range(blk_s):
TMP+=(x[ii*blk_s+i]-y[jj*blk_s+j])**2
z[ii*blk_s+i]=TMP
for i in nb.prange(x.shape[0]):
TMP=z[i]
for j in range(num_blk_y*blk_s,y.shape[0]):
TMP+=(x[i]-y[j])**2
z[i]=TMP
for i in nb.prange(num_blk_x*blk_s,x.shape[0]):
TMP=z[i]
for j in range(num_blk_y*blk_s):
TMP+=(x[i]-y[j])**2
z[i]=TMP
return z
Timings
N=int(2*1e6)
a=np.random.rand(N)
b=10*(np.random.rand(N)-0.5)
t1=time.time()
res_1=A_nb_p(a,b)
print(time.time()-t1)
#298.9394392967224
t1=time.time()
res_2=A_nb_p_2(a,b)
print(time.time()-t1)
#70.12
来源:https://stackoverflow.com/questions/58177426/outer-subtraction-with-numpy