问题
I am trying to run this code with these different n sizes on an Xeon Phi KNC. I am getting the timings as shown in the table, but I have no idea why I am experiencing those fluctuations. Can you please guide me through it? Thanks in advance.
CODE:
program prog
integer, allocatable :: arr1(:), arr2(:)
integer :: i, n, time_start, time_end
n=481
do while (n .le. 481000000)
allocate(arr1(n),arr2(n))
call system_clock(time_start)
!dir$ offload begin target(mic)
!$omp SIMD
do i=1,n
arr1(i) = arr1(i) + arr2(i)
end do
!dir$ end offload
call system_clock(time_end)
write (,) "n=",n," time=",time_end-time_start
deallocate(arr1,arr2)
n = n*10
end do
end program
RESULTS:
n= 481 time= 8881
n= 4810 time= 75
n= 48100 time= 53
n= 481000 time= 261
n= 4810000 time= 1991
n= 48100000 time= 18912
n= 481000000 time= 188203
回答1:
The first offload (n=481) will certainly be slow because that is where you are offloading all of the code and initialising the process on the KNC. If you don't want to see that do an empty offload before you start timing things.
At the high end (>=481000), things seem sane; each run is ~10x slower than the one before, so the only oddities now are the scaling of the lower ones. It's possible that some of that is related to load imbalance. If you have a 60 core processor and are running 4T/C (you didn't give us this information), 4810 iterations => ~20 iterations/core which means the SIMD performance is likely to be poor,as you have 16 lanes. Given misalignment you may only be executing a lead-in and lead-out, and nothing at full width!)
来源:https://stackoverflow.com/questions/51207834/unexplained-xeon-phi-overhead