Why is my f2py programs slower than python programs

问题

I recently wrote a time consuming program with python and decided to rewrite the most time consuming part with fortran.

However, the fortran code, wrapped with f2py, is slower than python code, Can anyone tell me how to find what is happening here?

For reference, here's the python function:

def iterative_method(alpha0, beta0, epsilon0, epsilons0, omega, smearing=0.01, precision=0.01, max_step=20, flag=0):
    # alpha0, beta0, epsilon0, epsilons0 are numpy arrays
    m, n = np.shape(epsilon0)
    Omega = np.eye(m, dtype=np.complex) * (omega + smearing * 1j)
    green = LA.inv(Omega - epsilon0) # LA is numpy.linalg
    alpha = np.dot(alpha0, np.dot(green, alpha0))
    beta = np.dot(beta0, np.dot(green, beta0))
    epsilon = epsilon0 + np.dot(alpha0, np.dot(green, beta0)) + np.dot(beta0, np.dot(green, alpha0))
    epsilons = epsilons0 + np.dot(alpha0, np.dot(green, beta0))

    while np.max(np.abs(alpha0)) > precision and np.max(np.abs(beta0)) > precision and flag < max_step:
        flag += 1
        return iterative_method(alpha, beta, epsilon, epsilons, omega, smearing, precision, min_step, max_step, flag)
return epsilon, epsilons, flag

The corresponding fortran code is

SUBROUTINE iterate(eout, esout, alpha, beta, e, es, omega, smearing, prec, max_step, rank)
    INTEGER, PARAMETER :: dp = kind(1.0d0)
    REAL(kind=dp) :: omega, smearing, prec
    INTEGER :: max_step, step, rank, cnt
    COMPLEX(kind=dp) :: alpha(rank,rank), beta(rank,rank), omega_mat(rank, rank),&
     green(rank, rank), e(rank,rank), es(rank,rank)
    COMPLEX(kind=dp), INTENT(out) :: eout(rank, rank), esout(rank, rank)
    step = 0
    omega_mat = 0
    DO cnt=1, rank
        omega_mat(cnt, cnt) = 1.0_dp
    ENDDO
    omega_mat = omega_mat * (omega + (0.0_dp, 1.0_dp) * smearing)
    DO WHILE (maxval(abs(alpha)) .gt. prec .or.  maxval(abs(beta)) .gt. prec .and. step .lt. max_step)
        green = zInverse(rank, omega_mat - e) ! zInverse is calling lapack to compute inverse of the matrix
        e = e + matmul(alpha, matmul(green, beta)) + matmul(beta, matmul(green, alpha))
        es = es + matmul(alpha, matmul(green, beta))
        alpha = matmul(alpha, matmul(green, alpha))
        beta = matmul(beta, matmul(green, beta))
        step = step + 1
    ENDDO
    print *, step
    eout = e
    esout = es
END SUBROUTINE iterate

In a test, python code used about 5 seconds while fortran code used about 7 seconds, which is hardly acceptable. Also, I can hardly see any overhead in fortran code. Is the wrapper to be blamed?

Edit: I didn't use BlAS for matmul. After using BLAS, fortran and python performace are both around 5 seconds.

回答1:

First, do this on the python code so you know exactly how it spends its time. Then, you can do a similar thing on the Fortran code using a debugger, if you like.

I suspect essentially all of the time goes into matrix operations, so any speed difference is due to the math library, not to the language that calls it. This post relays some of my experience doing that. Often the routines to do things like matrix multiplication, inverse, or Cholesky transform, are designed to be efficient on large matrices, but not on small.

For example, the LAPACK matrix-multiplication routine DGEMM has two character arguments, TRANSA and TRANSB, which can be upper or lower case, specifying whether each input matrix is transposed. To examine the value of those arguments, it calls a function LSAME. I found that, if I am spending a large fraction of my time multiplying small matrices, like 4x4, the program actually spends nearly all of its time calling LSAME, and very little time actually multiplying matrices. You can see how it would be easy to fix that.

来源：https://stackoverflow.com/questions/35958279/why-is-my-f2py-programs-slower-than-python-programs

标签

python

performance

fortran

f2py