suppose I have the following serial C:
int add(int* a, int* b, int n)
{
for(i=0; i
I am assuming you are working with n-by-n, row major order array. Try the following :
__global__ void calc(int *A, int *B, int n)
{
int i= blockIdx.x * blockDim.x + threadIdx.x;
int j= blockIdx.y * blockDim.y + threadIdx.y;
if (i<n && j<n) {
A[i*n+j] += B[i*n+j];
}
}
The compiler warnings you are getting come from the fact that on older GPUs, the memory structure is not "flat". The compiler can't know what memory space the addresses held by the pointer arrays your kernel is working in are. So it is warning you that it is assuming the operation is being peforming in global memory. If you compile the code for a Fermi card (sm_20 or sm_21 architecture), you won't see that warning because the memory model on those cards is "flat", and pointers are correctly interpreted by the hardware at runtime. The compiler doesn't need to handle it at compile time.
To answer each of your questions:
Yes. And no. The overall idea is about 90% right, but there are several implementation issues which will become apparent from the answers which follow.
CUDA C has built in variables to allow each thread to determine its "coordinates" in the execution grid which it is running, and the dimensions of each block and the grid itsef. threadIdx.{xyz}
provides the thread coordinates within a block, and blockIdx.{xyz}
the block coordinate with the grid. blockDim.{xyz}
and gridDim.{xyz}
provide the dimensions of the block and the grid, respectively (note not all hardware supports 3D grids). CUDA uses column major order for numbering threads within each block and block within each grid. The calculation you are querying is computing the equivalent {i,j}
coordinate in a 2D grid using the thread and block coordinates and the block size. This is discussed in some detail in the first few pages of the "Programming model" chapter of the CUDA programming guide.
No, and I say that for two reasons.
Firstly, using arrays of pointers for memory access is not a good idea in CUDA. Two levels of pointer indirection hugely increases the latency penalty to get to your data. The key difference in a typical GPU architecture compared to a modern CPU achitecture is the memory system. GPUs have stunningly high peak memory bandwidth, but very high access latency, whereas CPUs are designed for minimal latency. So having to read and indirect two pointers to fetch a value from memory is a very big performance penalty. Store your 2D array or matrix in linear memory instead. This is what BLAS, LAPACK and Matlab do anyway.
Secondly, every thread in your code is performing four integer arithmetic operations of setup overhead (the index calculations) for every one "productive" integer operation (the addition). There are strategies to reduce that, usually involving having each thread process more than one array element.
If I was to write a kernel for that operation I would do it something like the code at the bottom of my answer. This uses linear memory and a 1D grid. A suitable number of threads to properly occupy the GPU process the whole input array, with each thread processing many inputs.
No. As I mentioned earlier in my answer, Matlab uses linear memory to store matrices, not an array of pointers. This doesn't match the layout your kernel code is expecting.
Sample code:
__global__ void calc(int *A, int *B, int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int s = blockDim.x * gridDim.x;
for( ; i<N; i+=s) {
A[i] += B[i];
}
}