On Page 21 of the CUDA 4.0 programming guide there is an example (given below) to illustrate looping over the elements of a 2D array of floats in device memory. The dimensi
*(devPtr + 1)
will offset the pointer by 4 bytes (sizeof(float)
) before the *
dereferences it.
*((char)devPtr + 1)
will offset the pointer by 1 byte (sizeof(char)
) before the *
dereferences it..
This is due to the way pointer arithmetic works in C. When you add an integer x
to a pointer p
, it doesn't always add x
bytes. It adds x
times sizeof([type that p points to])
.
float* row = (float*)((char*)devPtr + r * pitch);
By casting devPtr
to a char*
, the offset that is applied (r * pitch*
) is in number of 1-byte increments. (because a char
is one byte). Had the cast not been there, the offset applied to devPtr would be r * pitch
times 4 bytes, as a float
is four bytes.
For example, if we have:
float* devPtr = 1000;
int r = 4;
Now, let's leave out the cast:
float* result1 = (devPtr + r);
// result1 = devPtr + (r * sizeof(float)) = 1016;
Now, if we include the cast:
float* result2 = (float*)((char*)devPtr + r);
// result2 = devPtr + (r * sizeof(char)) = 1004;
The cast is just to make the pointer arithmetic work right;
(float*)((char*)devPtr + r * pitch);
moves r*pitch bytes forward while
(float*)(devPtr + r * pitch);
would move r*pitch floats forward (ie 4 times as many bytes)