printf inside CUDA __global__ function
问题 I am currently writing a matrix multiplication on a GPU and would like to debug my code, but since I can not use printf inside a device function, is there something else I can do to see what is going on inside that function. This my current function: __global__ void MatrixMulKernel(Matrix Ad, Matrix Bd, Matrix Xd){ int tx = threadIdx.x; int ty = threadIdx.y; int bx = blockIdx.x; int by = blockIdx.y; float sum = 0; for( int k = 0; k < Ad.width ; ++k){ float Melement = Ad.elements[ty * Ad.width