问题
I recently wrote a time consuming program with python and decided to rewrite the most time consuming part with fortran.
However, the fortran code, wrapped with f2py, is slower than python code, Can anyone tell me how to find what is happening here?
For reference, here's the python function:
def iterative_method(alpha0, beta0, epsilon0, epsilons0, omega, smearing=0.01, precision=0.01, max_step=20, flag=0):
# alpha0, beta0, epsilon0, epsilons0 are numpy arrays
m, n = np.shape(epsilon0)
Omega = np.eye(m, dtype=np.complex) * (omega + smearing * 1j)
green = LA.inv(Omega - epsilon0) # LA is numpy.linalg
alpha = np.dot(alpha0, np.dot(green, alpha0))
beta = np.dot(beta0, np.dot(green, beta0))
epsilon = epsilon0 + np.dot(alpha0, np.dot(green, beta0)) + np.dot(beta0, np.dot(green, alpha0))
epsilons = epsilons0 + np.dot(alpha0, np.dot(green, beta0))
while np.max(np.abs(alpha0)) > precision and np.max(np.abs(beta0)) > precision and flag < max_step:
flag += 1
return iterative_method(alpha, beta, epsilon, epsilons, omega, smearing, precision, min_step, max_step, flag)
return epsilon, epsilons, flag
The corresponding fortran code is
SUBROUTINE iterate(eout, esout, alpha, beta, e, es, omega, smearing, prec, max_step, rank)
INTEGER, PARAMETER :: dp = kind(1.0d0)
REAL(kind=dp) :: omega, smearing, prec
INTEGER :: max_step, step, rank, cnt
COMPLEX(kind=dp) :: alpha(rank,rank), beta(rank,rank), omega_mat(rank, rank),&
green(rank, rank), e(rank,rank), es(rank,rank)
COMPLEX(kind=dp), INTENT(out) :: eout(rank, rank), esout(rank, rank)
step = 0
omega_mat = 0
DO cnt=1, rank
omega_mat(cnt, cnt) = 1.0_dp
ENDDO
omega_mat = omega_mat * (omega + (0.0_dp, 1.0_dp) * smearing)
DO WHILE (maxval(abs(alpha)) .gt. prec .or. maxval(abs(beta)) .gt. prec .and. step .lt. max_step)
green = zInverse(rank, omega_mat - e) ! zInverse is calling lapack to compute inverse of the matrix
e = e + matmul(alpha, matmul(green, beta)) + matmul(beta, matmul(green, alpha))
es = es + matmul(alpha, matmul(green, beta))
alpha = matmul(alpha, matmul(green, alpha))
beta = matmul(beta, matmul(green, beta))
step = step + 1
ENDDO
print *, step
eout = e
esout = es
END SUBROUTINE iterate
In a test, python code used about 5 seconds while fortran code used about 7 seconds, which is hardly acceptable. Also, I can hardly see any overhead in fortran code. Is the wrapper to be blamed?
Edit: I didn't use BlAS
for matmul
. After using BLAS
, fortran and python performace are both around 5 seconds.
回答1:
First, do this on the python code so you know exactly how it spends its time. Then, you can do a similar thing on the Fortran code using a debugger, if you like.
I suspect essentially all of the time goes into matrix operations, so any speed difference is due to the math library, not to the language that calls it. This post relays some of my experience doing that. Often the routines to do things like matrix multiplication, inverse, or Cholesky transform, are designed to be efficient on large matrices, but not on small.
For example, the LAPACK matrix-multiplication routine DGEMM has two character arguments, TRANSA and TRANSB, which can be upper or lower case, specifying whether each input matrix is transposed. To examine the value of those arguments, it calls a function LSAME. I found that, if I am spending a large fraction of my time multiplying small matrices, like 4x4, the program actually spends nearly all of its time calling LSAME, and very little time actually multiplying matrices. You can see how it would be easy to fix that.
来源:https://stackoverflow.com/questions/35958279/why-is-my-f2py-programs-slower-than-python-programs