问题
I recently discovered the racecheck tool of cuda-memcheck, available in CUDA 5.0 (cuda-memcheck --tool racecheck
, see the NVIDIA doc). This tool can detect race conditions with shared memory in a CUDA kernel.
In debug mode, this tool does not detect anything, which is apparently normal. However, in release mode (-O3), I get errors depending on the parameters of the problem.
Here is an error example (initialization of shared memory on line 22, assignment on line 119):
========= ERROR: Potential WAW hazard detected at shared 0x0 in block (35, 0, 0) : ========= Write Thread (32, 0, 0) at 0x00000890 in ....h:119:void kernel_test3(Data*) ========= Write Thread (0, 0, 0) at 0x00000048 in ....h:22:void kernel_test3(Data*)
========= Current Value : 13, Incoming Value : 0
- The first thing that surprised me is the thread ids. When I first encountered the error, each block contained 32 threads (ids 0 to 31). So why is there a problem with the thread id 32? I even added an extra check on
threadIdx.x
, but this changed nothing. - I use shared memory as a temporary buffer, and each thread deals with its own parameters of a multidimensional array, e.g.
__shared__ float arr[SIZE_1][SIZE_2][NB_THREADS_PER_BLOCK]
. I do not really understand how there could be any race conditions, since each thread deals with its own part of shared memory. - Reducing the grid size from 64 blocks to 32 blocks seemed to solve the issue (with 32 threads per block). I do not understand why.
In order to understand what was happening, I tested with some simpler kernels.
Let me show you an example of a kernel that creates that kind of error. Basically, this kernel uses SIZE_X*SIZE_Y*NTHREADS*sizeof(float)
B of shared memory, and I can use 48KB of shared memory per SM.
test.cu
template <unsigned int NTHREADS>
__global__ void kernel_test()
{
const int SIZE_X = 4;
const int SIZE_Y = 4;
__shared__ float tmp[SIZE_X][SIZE_Y][NTHREADS];
for (unsigned int i = 0; i < SIZE_X; i++)
for (unsigned int j = 0; j < SIZE_Y; j++)
tmp[i][j][threadIdx.x] = threadIdx.x;
}
int main()
{
const unsigned int NTHREADS = 32;
//kernel_test<NTHREADS><<<32, NTHREADS>>>(); // ---> works fine
kernel_test<NTHREADS><<<64, NTHREADS>>>();
cudaDeviceSynchronize(); // ---> gives racecheck errors if NBLOCKS > 32
}
Compilation:
nvcc test.cu --ptxas-options=-v -o test
If we run the kernel:
cuda-memcheck --tool racecheck test
kernel_test<32><<<32, 32>>>();
: 32 blocks, 32 threads => does not lead to any apparent racecheck error.kernel_test<32><<<64, 32>>>();
: 64 blocks, 32 threads => leads to WAW hazards (threadId.x = 32?!) and errors.
========= ERROR: Potential WAW hazard detected at shared 0x6 in block (57, 0, 0) :
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Write Thread (1, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 128========= INFO:(Identical data being written) Potential WAW hazard detected at shared 0x0 in block (47, 0, 0) :
========= Write Thread (32, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 0
So what am I missing here? Am I doing something wrong with shared memory? (I am still a beginner with this)
** UPDATE **
The problem seems to be coming from cudaDeviceSynchronize()
when NBLOCKS > 32
. Why is this happening?
回答1:
For starters, the cudaDeviceSynchronize() isn't the cause; your kernel is the cause, but it's an asynchronous call, so the error is caught on your call to cudaDeviceSynchronize().
As for kernel, your shared memory is of size SIZE_X*SIZE_Y*NTHREADS (which in the example translates to 512 elements per block). In your nested loops you index into it using [i*blockDim.x*SIZE_Y + j*blockDim.x + threadIdx.x] -- this is where your problem is.
To be more specific, your i and j values will range from [0, 4), your threadIdx.x from [0, 32), and your SIZE_{X | Y} values are 4. When blockDim.x is 64, your maximum index used in the loop will be 991 (from 3*64*4 + 3*64 + 31). When your blockDim.x is 32, your maximum index will be 511.
Based on your code, you should get errors whenever your NBLOCKS exceeds your NTHREADS
NOTE: I originally posted this to https://devtalk.nvidia.com/default/topic/527292/cuda-programming-and-performance/cuda-racecheck-shared-memory-array-and-cudadevicesynchronize-/
回答2:
This was apparently a bug in NVIDIA drivers for Linux. The bug disappeared after the 313.18 release.
来源:https://stackoverflow.com/questions/13861017/cuda-racecheck-shared-memory-array-and-cudadevicesynchronize