CUDA coalesced access to global memory

问题

I have read CUDA programming guide, but i missed one thing. Let's say that i have array of 32bit int in global memory and i want to copy it to shared memory with coalesced access. Global array has indexes from 0 to 1024, and let's say i have 4 blocks each with 256 threads.

__shared__ int sData[256];

When is coalesced access performed?

sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y];

Adresses in global memory are copied from 0 to 255, each by 32 threads in warp, so here it's ok?

sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y + someIndex];

If someIndex is not multiple of 32 it is not coalesced? Misaligned adresses? Is that correct?

回答1:

What you want ultimately depends on whether your input data is a 1D or 2D array, and whether your grid and blocks are 1D or 2D. The simplest case is both 1D:

shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + threadIdx.x];

This is coalesced. The rule of thumb I use is that the most rapidly varying coordinate (the threadIdx) is added on as offset to the block offset (blockDim * blockIdx). The end result is that the indexing stride between threads in the block is 1. If the stride gets larger, then you lose coalescing.

The simple rule (on Fermi and later GPUs) is that if the addresses for all threads in a warp fall into the same aligned 128-byte range, then a single memory transaction will result (assuming caching is enabled for the load, which is the default). If they fall into two aligned 128-byte ranges, then two memory transactions result, etc.

On GT2xx and earlier GPUs, it gets more complicated. But you can find the details of that in the programming guide.

Additional examples:

Not coalesced:

shmem[threadIdx.x] = gmem[blockDim.x + blockIdx.x * threadIdx.x];

Not coalesced, but not too bad on GT200 and later:

stride = 2;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];

Not coalesced at all:

stride = 32;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];

Coalesced, 2D grid, 1D block:

int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.x] = gmem[blockIdx.y * elementPitch + 
                          blockIdx.x * blockDim.x + threadIdx.x];

Coalesced, 2D grid and block:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.y * blockDim.x + threadIdx.x] = gmem[y * elementPitch + x];

回答2:

Your indexing at 1 is wrong (or intentionally so strange it seems wrong), some blocks access same element in each thread, so there is no way for coalesced access in these blocks.

Proof:

Example:

Grid = dim(2,2,0)

t(blockIdx.x, blockIdx.y)

//complete block reads at 0
t(0,0) -> sData[threadIdx.x] = gData[0];
//complete block reads at 2
t(0,1) -> sData[threadIdx.x] = gData[2];
//definetly coalesced
t(1,0) -> sData[threadIdx.x] = gData[threadIdx.x];
//not coalesced since 2 is no multiple of a half of the warp size = 16
t(1,1) -> sData[threadIdx.x] = gData[threadIdx.x + 2];

So its a "luck" game if a block is coalesced, so in general No

But coalesced memory reads rules are not as strict on newer cuda versions as before.
But for compatibility issues you should try to optimise kernels for lowest cuda versions, if it is possible.

Here is some nice source:

http://mc.stanford.edu/cgi-bin/images/0/0a/M02_4.pdf

回答3:

The rules for which accesses can be coalesced are somewhat complicated and they have changed over time. Each new CUDA architecture is more flexible in what it can coalesce. I would say not to worry about it at first. Instead, do the memory accesses in whatever way is the most convenient and then see what the CUDA profiler says.

回答4:

Your examples are correct if you intended to use a 1D grid and thread-geometry. I think the indexing you intended to use is [blockIdx.x*blockDim.x + threadIdx.x].

With #1, the 32 threads in a warp execute that instruction 'simultaneously' so their requests, which are sequential and aligned to 128B (32 x 4), are coalesced in both Tesla and Fermi architectures, I believe.

With #2, it is a bit blurry. If someIndex is 1, then it won't coalesce all of the 32 requests in a warp, but it might do partial coalescing. I believe Fermi devices will coalesce the accesses for threads 1-31 in a warp as a part of a 128B sequential segment of memory (and the first 4B, which no thread needs, are wasted). I think Tesla architecture devices would make that an uncoalesced access due to the misalignment, but I am not sure.

With someIndex as, say, 8, Tesla will have 32B aligned addresses, and Fermi might group them as 32B, 64B, and 32B. But the bottom line is, depending on the value of someIndex and the architecture, what happens is blurry, and it won't necessarily be terrible.

来源：https://stackoverflow.com/questions/10325244/cuda-coalesced-access-to-global-memory

标签

memory

cuda

copy

coalescing