From non coalesced access to coalesced memory access CUDA

问题

I was wondering if there is any simple way to transform a non-coalesced memory access into a coalesced one. Let's take the example of this array:

dW[[w0,w1,w2][w3,w4,w5][w6,w7][w8,w9]]

Now, i know that if Thread 0 in block 0 access dW[0] and then Thread 1 in block 0 access dw[1], that's a coalesced access in the global memory. The problem is that i have two operations. The first one is coalesced as described above. But the second one isn't because Thread 1 in block 0 needs to do an operation on both dW[0], dW[1] and dW[2].

I know that the initial shape of the container allow or forbid the coalesced access. But dW is a very big array, and i can't transform it during the process.

Do you know if it's possible to aleviate this problem?

回答1:

You can try to use shared memory maybe, that might work (or not, hard to tell without an example).

For instance, say the first operation access coalesced data and the second one strides a lot; this may speedup things

__shared__ int shared[BLOCK_SIZE];
// Load data global -> shared with coalesced access ; you may need to load a bit more before/after depending on you application
shared[tid] = global[some id]
syncthreads();
// Do the math with coalescing access
function0(shared[tid])
// Do the math with the non coalescing access
function1(shared[tid+-1 or wathever])

The idea is to load data in shared in a coalescent manner, and then use shared to do the math, since coalescent access do not matter with shared memory (but bank conflict do on the other hand ; that's usually fine though).

You'll have to give us more information if you want a more accurate help. That's just a hint.

来源：https://stackoverflow.com/questions/39876785/from-non-coalesced-access-to-coalesced-memory-access-cuda

标签

c++

cuda

gpgpu

memory-access