Let p
be a matrix of first set of locations where each row gives the coordinates of a particular point. Similarly, let q
be a matrix of second set of l
The problem looks simple enough to make a library overkill.
Without knowing the range of i
and j
, I'd suggest you partition k
into blocks of a multiple of 32 threads each and in each block, compute
float sum, myp[d];
int i = blockIdx.x*blockDim.x + threadIdx.x;
for ( int kk = 0 ; kk < d ; kk++ )
myp[kk] = p(i,kk);
for ( j = blockIdx.y*blockDim.y ; j < (blockIdx.y+1)*blockDim ; j++ ) {
sum = 0.0f;
#pragma unroll
for ( int kk = 0 ; kk < d ; kk++ ) {
temp = myp[kk] - q(j,kk);
sum += temp*temp;
}
k(i,j) = sum;
}
where I am assuming that your data has d
dimensions and writing p(i,k)
, q(j,k)
and k(i,j)
to mean an access to a two-dimensional array. I also took the liberty in assuming that your data is of type float
.
Note that depending on how k
is stored, e.g. row-major or column-major, you may want to loop over i
per thread instead to get coalesced writes to k
.