Let\'s say I have several threads and they access memory at addresses A+0, A+4, A+8, A+12 (each access = next thread). Such access is coalesced, right?
However if I
It's also worth noting that a main purpose of the L2 cache in an Nvidia GPU is to collapse reads and coalesce writes. So if one warp was accessing
thread 0 -> A+0
thread 1 -> A+8
thread 2 -> A+16
thread 3 -> A+24
...
and another warp was accessing
thread 0 -> A+4
thread 1 -> A+12
thread 2 -> A+20
thread 3 -> A+28
...
these two accesses will not coalesce inside the SM but generally will coalesce in the L2 cache, so that GPU memory will only be touched once.
Yes, for cc 2.0 and newer GPUs, coalescing will occur for any random arrangement of 32 bit data elements to threads, as long as all the requested 32-bit data elements are coming from (requested from) the same 128 byte (and 128 byte aligned) region in global memory.
The GPU has something like a "crossbar switch" in the memory controller that will distribute elements as needed. You may be interested in this GPU webinar which discusses coalescing and will illustrate this particular case pictorially (on slide 12).
The NVIDIA webinar page has other useful webinars you may be interested in as well.
For pre-cc2.0 devices the specifics vary by compute capability, but compute 1.0 and 1.1 capable devices do not have this ability to coalesce reads that are in "reverse order" or random order.