Unexplained Xeon-Phi Overhead

问题

I am trying to run this code with these different n sizes on an Xeon Phi KNC. I am getting the timings as shown in the table, but I have no idea why I am experiencing those fluctuations. Can you please guide me through it? Thanks in advance.

CODE:

program prog
  integer, allocatable :: arr1(:), arr2(:)
  integer :: i, n, time_start, time_end
  n=481
  do while (n .le. 481000000)
    allocate(arr1(n),arr2(n))
    call system_clock(time_start)
    !dir$ offload begin target(mic)
    !$omp SIMD 
    do i=1,n
       arr1(i) = arr1(i) + arr2(i)
    end do
    !dir$ end offload 
    call system_clock(time_end)
    write (,) "n=",n," time=",time_end-time_start
    deallocate(arr1,arr2)
    n = n*10
  end do
end program

RESULTS:

 n=         481  time=        8881
 n=        4810  time=          75
 n=       48100  time=          53
 n=      481000  time=         261
 n=     4810000  time=        1991
 n=    48100000  time=       18912
 n=   481000000  time=      188203

回答1:

The first offload (n=481) will certainly be slow because that is where you are offloading all of the code and initialising the process on the KNC. If you don't want to see that do an empty offload before you start timing things.

At the high end (>=481000), things seem sane; each run is ~10x slower than the one before, so the only oddities now are the scaling of the lower ones. It's possible that some of that is related to load imbalance. If you have a 60 core processor and are running 4T/C (you didn't give us this information), 4810 iterations => ~20 iterations/core which means the SIMD performance is likely to be poor,as you have 16 lanes. Given misalignment you may only be executing a lead-in and lead-out, and nothing at full width!)

来源：https://stackoverflow.com/questions/51207834/unexplained-xeon-phi-overhead

标签

parallel-processing

acceleration

xeon-phi

offloading