My CUDA application has constant memory of less than 8KB. Since it will all be cached, do I need to worry about every thread accessing the same address for optimization?
<Since it will all be cached, do I need to worry about every thread accessing the same address for optimization?
Yes. The cache itself can only serve up one 32-bit word per cycle.
If yes, how do I assure all threads are accessing the same address at the same time?
Ensure that whatever kind of indexing or addressing you use to reference an element in the constant memory area does not depend on any of the built in thread variables, e.g. threadIdx.x
, threadIdx.y
, or threadIdx.z
. Note that the actual requirement is less stringent than this. You can achieve the necessary goal as long as the indexing evaluates to the same number for every thread in a given warp. Here are a few examples:
__constant__ int data[1024];
...
// assume 1D threadblock
int idx = threadIdx.x;
int bidx = blockIdx.x;
int a = data[idx]; // bad - every thread accesses a different element
int b = data[12]; // ok - every thread accesses the same element
int c = data[b]; // ok - b is a constant w.r.t threads
int d = data[b + idx]; // bad
int e = data[b + bidx]; // ok
int f = data[idx/32]; // ok - the same element is being accessed per warp