问题
I am dealing with subroutine that is very inefficient when the array size becomes large, for example, NN=1000, KK=200, MM = 200. But, I can not come up with ideas to optimize it.
program main
implicit none
integer :: NN, KK, MM
integer, allocatable, dimension(:,:) :: id
complex*16, allocatable, dimension(:) :: phase
complex*16 :: phase_base(3)
real*8, allocatable, dimension(:,:) :: wave_base
complex*16, allocatable, dimension(:,:) :: wave
integer :: i, j, k, n
NN = 1000
KK = 200
MM = 200
allocate(id(MM,3))
allocate(phase(KK))
allocate(wave_base(KK, NN*(NN+1)/2 ))
allocate(wave(NN, NN))
id(:,:) = 2
phase_base(:) = (1.0d0,1.0d0)
wave_base(:,:) = 1.0d0
phase(:) = (1.0d0,1.0d0)
call noise_wave(NN, KK, MM, id, phase, phase_base, wave_base, wave)
deallocate(id)
deallocate(phase)
deallocate(wave_base)
deallocate(wave)
end program main
subroutine noise_wave(NN, KK, MM, id, phase_1, phase_base, wave_base, wave)
implicit none
integer, intent(in) :: NN, KK, MM
integer, intent(in), dimension(MM, 3) :: id
complex*16, intent(in) :: phase_1(KK)
complex*16, intent(in) :: phase_base(3)
real*8, intent(in) :: wave_base(KK, NN*(NN+1)/2 )
complex*16, intent(out) :: wave(NN, NN)
integer :: i, j, k, p, n
integer :: x, y, z
real :: start, finish
complex*16 :: phase_2, phase_2_conjg
do p = 1, MM
x = id(p, 1)
y = id(p, 2)
z = id(p, 3)
phase_2 = (phase_base(1) ** x) * (phase_base(2) ** y) * (phase_base(3) ** z)
phase_2_conjg = conjg(phase_2)
n = 0
do j = 1, NN
do i = 1, j ! upper triangle
n = n + 1
do k = 1, KK
wave(i,j) = wave(i,j) + wave_base(k,n) * phase_1(k) * phase_2_conjg
enddo
wave(j,i) = conjg(wave(i,j) )
enddo
enddo
enddo
end subroutin
Could someone give me some hint? (I have fulfill the suggested optimizations. Also, following Ian's suggestion, I have added a small test. Thus you can test it directly.)
回答1:
You might get a measurable speedup if you change your loop nest to
do p = 1, MM
x = id(p, 1)
y = id(p, 2)
z = id(p, 3)
phase = (phase_base(1) ** x) * (phase_base(2) ** y) * (phase_base(3) ** z)
conjg_phase = conjg(phase) ! new variable, calculate here, use below
n = 0
do j = 1, NN
do i = 1, j
n = n + 1
do k = 1, KK
wave(i,j) = wave(i,j) + wave_base(k,n) * conjg_phase
enddo
enddo
wave(j,i) = conjg(wave(i,j) )
enddo
enddo
(and it might still be correct if I've understood the code !). Even little computations like the ones I've lifted out of the bottom of the loop nest are a drag if repeated often enough. And the execution speed might benefit from moving those values in and out of cache less often too.
It might be worth (a little) swapping the dimensions of id
, then reading id(1:3,p)
is likely to be more cache-friendly than the current version.
And if the execution speed is still not to your taste, time to learn OpenMP (if you don't know it already).
回答2:
Here are my solution following the nice ideas above. There is still some room for efficiency gain before OpenMP. For example, the first k loop in the subroutine can be eliminated by sum function.
program main
implicit none
integer :: NN, KK, MM
integer, allocatable, dimension(:,:) :: id
complex*16, allocatable, dimension(:) :: phase
complex*16 :: phase_base(3)
real*8, allocatable, dimension(:,:) :: wave_base
complex*16, allocatable, dimension(:,:) :: wave
integer :: i, j, k, n
NN = 1000
KK = 200
MM = 200
allocate(id(MM,3))
allocate(phase(KK))
allocate(wave_base(KK, NN*(NN+1)/2 ))
allocate(wave(NN, NN))
id(:,:) = 2
phase_base(:) = (1.0d0,1.0d0)
wave_base(:,:) = 1.0d0
phase(:) = (1.0d0,1.0d0)
call noise_wave(NN, KK, MM, id, phase, phase_base, wave_base, wave)
deallocate(id)
deallocate(phase)
deallocate(wave_base)
deallocate(wave)
end program main
subroutine noise_wave(NN, KK, MM, id, phase_1, phase_base, wave_base, wave)
implicit none
integer, intent(in) :: NN, KK, MM
integer, intent(in), dimension(MM, 3) :: id
complex*16, intent(in) :: phase_1(KK)
complex*16, intent(in) :: phase_base(3)
real*8, intent(in) :: wave_base(KK, NN*(NN+1)/2 )
complex*16, intent(out):: wave(NN, NN)
integer :: i, j, k, p, n
integer :: x, y, z
real :: start, finish
complex*16 :: phase_2, phase_2_conjg
complex*16 :: wave_tmp(NN*(NN+1)/2)
complex*16 :: wave_tmp_2(NN*(NN+1)/2)
do k = 1, KK
wave_tmp(:) = wave_tmp(:) + wave_base(k,:) * phase_1(k)
enddo
do p = 1, MM
phase_2 = product(phase_base(:)**id(p,:) )
phase_2_conjg = conjg(phase_2)
wave_tmp2(:) = wave_tmp2(:) + wave_tmp(n) * phase_2_conjg
enddo
n = 0
do j = 1, NN
do i = 1, j
n = n + 1
wave(i,j) = wave_tmp2(n)
wave(j,i) = conjg(wave_tmp2(n) )
enddo
enddo
end subroutine
来源:https://stackoverflow.com/questions/61174309/how-to-optimize-this-fortran-subroutine-with-many-loops