问题
My code is being slowed down by a my 4D arrays access in global memory.
I am using PGI compiler 2010.
The 4D array I am accessing is read only from the device and the size is known at run time.
I wanted to allocate to the texture memory and found that my PGI version does not support texture. As the size is known only at run time, it is not possible to use constant memory too.
Only One dimension is known at compile time like this MyFourD(100, x,y,z)
where x,y,z are user input.
My first idea is about pointers but not familiar with pointer fortran.
If you have experience how to deal with such a situation, I will appreciate your help. Because only this makes my codes 5times slower than expected
Following is a sample code of what I am trying to do
int i,j,k
i = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
j = (blockIdx%y-1) * blockDim%y + threadIdx%y-1
do k = 0, 100
regvalue1 = somevalue1
regvalue2 = somevalue2
regvalue3 = somevalue3
d_value(i,j,k)=d_value(i,j,k)
& +myFourdArray(10,i,j,k)*regvalue1
& +myFourdArray(32,i,j,k)*regvalue2
& +myFourdArray(45,i,j,k)*regvalue3
end do
Best regards,
回答1:
I believe the answer from @Alexander Vogt is on the right track - I would think about re-ordering the array storage. But I would try it like this:
int i,j,k
i = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
j = (blockIdx%y-1) * blockDim%y + threadIdx%y-1
do k = 0, 100
regvalue1 = somevalue1
regvalue2 = somevalue2
regvalue3 = somevalue3
d_value(i,j,k)=d_value(i,j,k)
& +myFourdArray(i,j,k,10)*regvalue1
& +myFourdArray(i,j,k,32)*regvalue2
& +myFourdArray(i,j,k,45)*regvalue3
end do
Note that the only change is to myFourdArray
, there is no need for a change in data ordering in the d_value
array.
The crux of this change is that we are allowing adjacent threads to access adjacent elements in myFourdArray
and so we are allowing for coalesced access. Your original formulation forced adjacent threads to access elements that were separated by the length of the first dimension, and so did not allow for useful coalescing.
Whether in CUDA C or CUDA Fortran, threads are grouped in X first, then Y and then Z dimensions. So the rapidly varying thread subscript is X first. Therefore, in matrix access, we want this rapidly varying subscript to show up in the index that is also rapidly varying.
In Fortran this index is the first of a multiple-subscripted array.
In C, this index is the last of a multiple-subscripted array.
Your original code followed this convention for d_value
by placing the X thread index (i
) in the first array subscript position. But it broke this convention for myFourdArray
by putting a constant in the first array subscript position. Thus your access to myFourdArray
are noticeably slower.
When there is a loop in the code, we also don't want to place the loop variable first (for Fortran, or last for C) (i.e. k
, in this case, as Alexander Vogt did) because doing that will also break coalescing. For each iteration of the loop, we have multiple threads executing in lockstep, and those threads should all access adjacent elements. This is facilitated by having the X thread indexed subscript (e.g. i
) first (for Fortran, or last for C).
回答2:
You could invert the indexing, i.e. let the first dimension change the Fastest. Fortran is column major!
do k = 0, 100
regvalue1 = somevalue1
regvalue2 = somevalue2
regvalue3 = somevalue3
d_value(k,i,j)=d_value(k,i,j) + &
myFourdArray(k,i,j,10)*regvalue1 + &
myFourdArray(k,i,j,32)*regvalue2 + &
myFourdArray(k,i,j,45)*regvalue3
end do
If the last (in the original case second) dimension is always fixed (and not too large), consider individual arrays instead.
In my experience, pointers do not change much in terms of speed-up when applied to large arrays. What you could try is strip-mining to optimize your loops in terms of cache access, but I do not know the compile option to enable this with the PGI compiler.
Ah, ok it is a simple directive:
!$acc do vector
do k=...
enddo
来源:https://stackoverflow.com/questions/18958634/cuda-fortran-4d-array