On Page 21 of the CUDA 4.0 programming guide there is an example (given below) to illustrate looping over the elements of a 2D array of floats in device memory. The dimensi
The cast is just to make the pointer arithmetic work right;
(float*)((char*)devPtr + r * pitch);
moves r*pitch bytes forward while
(float*)(devPtr + r * pitch);
would move r*pitch floats forward (ie 4 times as many bytes)