Consider the following code:
__global__ void kernel(int *something) {
extern __shared__ int shared_array[];
// Some operations on shared_array here.
}
No. Shared memory is uninitialised. You have to somehow initialise it yourself, one way or another...
From CUDA C Programming Guide 3.2, Section B.2.4.2, paragraph 2:
__shared__
variables cannot have an initialization as part of their declaration.
This also discards nontrivial default constructors for shared variables.
You can efficiently initialize shared arrays in parallel like this
// if SHARED_SIZE == blockDim.x, eliminate this loop
for (int i = threadIdx.x; i < SHARED_SIZE; i += blockDim.x)
shared_array[i] = INITIAL_VALUE;
__syncthreads();
Yes, you can. You can specify that the first thread in the block sets it, while the other's don't eg.:
extern __shared__ unsigned int local_bin[]; // Size specified in kernel call
if (threadIdx.x == 0) // Wipe on first thread - include " && threadIdx.y == 0" and " && threadIdx.z == 0" if threadblock has 2 or 3 dimensions instead of 1.
{
// For-loop to set all local_bin array indexes to specified value here - note you cannot use cudaMemset as it translates to a kernel call itself
}
// Do stuff unrelated to local_bin here
__syncthreads(); // To make sure the memset above has completed before other threads start writing values to local_bin.
// Do stuff to local_bin here
Ideally you should do as much work as possible before the syncthreads call, as this allows for all the other threads to do their work before the memset is complete - obviously this only matters if the work has the potential to have quite different thread completion times, for example if there is conditional branching. Note that for the thread 0 "setting" for-loop, you need to have passed the size of the local_bin array as a parameter to the kernel so you know the size of the array you are iterating.
Original concept source