How to optimize this Fortran subroutine with many loops?

问题

I am dealing with subroutine that is very inefficient when the array size becomes large, for example, NN=1000, KK=200, MM = 200. But, I can not come up with ideas to optimize it.

program main

  implicit none

  integer :: NN, KK, MM
  integer, allocatable, dimension(:,:) :: id
  complex*16, allocatable, dimension(:) :: phase
  complex*16 :: phase_base(3)
  real*8, allocatable, dimension(:,:) :: wave_base

  complex*16, allocatable, dimension(:,:) :: wave
  integer :: i, j, k, n

  NN = 1000
  KK = 200
  MM = 200

  allocate(id(MM,3))
  allocate(phase(KK))
  allocate(wave_base(KK, NN*(NN+1)/2 ))
  allocate(wave(NN, NN))

  id(:,:) = 2

  phase_base(:) = (1.0d0,1.0d0)

  wave_base(:,:) = 1.0d0

  phase(:) = (1.0d0,1.0d0)

  call  noise_wave(NN, KK, MM, id, phase, phase_base, wave_base, wave)

  deallocate(id)
  deallocate(phase)
  deallocate(wave_base)
  deallocate(wave)

end program main

subroutine noise_wave(NN, KK, MM, id, phase_1, phase_base, wave_base, wave)
  implicit none

  integer, intent(in) :: NN, KK, MM
  integer, intent(in), dimension(MM, 3) :: id
  complex*16, intent(in) :: phase_1(KK)
  complex*16, intent(in) :: phase_base(3)
  real*8,  intent(in) :: wave_base(KK, NN*(NN+1)/2 )

  complex*16, intent(out) :: wave(NN, NN)

  integer :: i, j, k, p, n
  integer :: x, y, z
  real :: start, finish
  complex*16 :: phase_2, phase_2_conjg

  do p = 1, MM

    x = id(p, 1)
    y = id(p, 2)
    z = id(p, 3)

    phase_2 = (phase_base(1) ** x) * (phase_base(2) ** y) * (phase_base(3) ** z)

    phase_2_conjg = conjg(phase_2)

    n = 0
    do j = 1, NN
      do i = 1, j   ! upper triangle

        n = n + 1

        do k = 1, KK

          wave(i,j) = wave(i,j) + wave_base(k,n) * phase_1(k) * phase_2_conjg

        enddo

        wave(j,i) = conjg(wave(i,j) )

      enddo
    enddo
  enddo

end subroutin

Could someone give me some hint? (I have fulfill the suggested optimizations. Also, following Ian's suggestion, I have added a small test. Thus you can test it directly.)

回答1:

You might get a measurable speedup if you change your loop nest to

  do p = 1, MM

     x = id(p, 1)
     y = id(p, 2)
     z = id(p, 3)
     phase = (phase_base(1) ** x) * (phase_base(2) ** y) * (phase_base(3) ** z)
     conjg_phase = conjg(phase)  ! new variable, calculate here, use below

     n = 0
     do j = 1, NN
        do i = 1, j   
           n = n + 1
           do k = 1, KK
              wave(i,j) = wave(i,j) + wave_base(k,n) * conjg_phase
           enddo
        enddo
        wave(j,i) = conjg(wave(i,j) )
     enddo
  enddo

(and it might still be correct if I've understood the code !). Even little computations like the ones I've lifted out of the bottom of the loop nest are a drag if repeated often enough. And the execution speed might benefit from moving those values in and out of cache less often too.

It might be worth (a little) swapping the dimensions of id, then reading id(1:3,p) is likely to be more cache-friendly than the current version.

And if the execution speed is still not to your taste, time to learn OpenMP (if you don't know it already).

回答2:

Here are my solution following the nice ideas above. There is still some room for efficiency gain before OpenMP. For example, the first k loop in the subroutine can be eliminated by sum function.

program main

  implicit none

  integer :: NN, KK, MM
  integer, allocatable, dimension(:,:) :: id
  complex*16, allocatable, dimension(:) :: phase
  complex*16 :: phase_base(3)
  real*8, allocatable, dimension(:,:) :: wave_base

  complex*16, allocatable, dimension(:,:) :: wave
  integer :: i, j, k, n

  NN = 1000
  KK = 200
  MM = 200

  allocate(id(MM,3))
  allocate(phase(KK))
  allocate(wave_base(KK, NN*(NN+1)/2 ))
  allocate(wave(NN, NN))

  id(:,:) = 2

  phase_base(:) = (1.0d0,1.0d0)

  wave_base(:,:) = 1.0d0

  phase(:) = (1.0d0,1.0d0)

  call  noise_wave(NN, KK, MM, id, phase, phase_base, wave_base, wave)

  deallocate(id)
  deallocate(phase)
  deallocate(wave_base)
  deallocate(wave)

end program main

subroutine noise_wave(NN, KK, MM, id, phase_1, phase_base, wave_base, wave)
  implicit none

  integer, intent(in) :: NN, KK, MM
  integer, intent(in), dimension(MM, 3) :: id
  complex*16, intent(in) :: phase_1(KK)
  complex*16, intent(in) :: phase_base(3)
  real*8,     intent(in) :: wave_base(KK, NN*(NN+1)/2 )
  complex*16, intent(out):: wave(NN, NN)

  integer :: i, j, k, p, n
  integer :: x, y, z
  real :: start, finish
  complex*16 :: phase_2, phase_2_conjg
  complex*16 :: wave_tmp(NN*(NN+1)/2)
  complex*16 :: wave_tmp_2(NN*(NN+1)/2)

  do k = 1, KK

    wave_tmp(:) = wave_tmp(:) + wave_base(k,:) * phase_1(k)

  enddo

  do p = 1, MM
    phase_2 = product(phase_base(:)**id(p,:) )
    phase_2_conjg = conjg(phase_2)

    wave_tmp2(:) = wave_tmp2(:) + wave_tmp(n) * phase_2_conjg
  enddo

  n = 0
  do j = 1, NN
    do i = 1, j
        n = n + 1
        wave(i,j) = wave_tmp2(n)
        wave(j,i) = conjg(wave_tmp2(n) )
    enddo
  enddo

end subroutine

来源：https://stackoverflow.com/questions/61174309/how-to-optimize-this-fortran-subroutine-with-many-loops

标签

fortran

gfortran