I want to parallelize a function in CUDA C which will count all vectors with sum equal of vector elements and elements not bigger than k. For example if the number of vector ele
As Robert said in comments, if you want to generate all (k+1)^n permutations on GPU and test them, you can think of some GPU kernel like this:
__device__ int count; //global variable must be initialized to zero before kernel call
__global__ void perm_generator(int k, int n, int sum) {
int tid = blockIdx.x*blockDim.x+threadIdx.x;
int id = tid;
int mysum = 0;
for ( int i = n; i > 1; i-- ) { //all n-1 vector elements
mysum += (id % (k+1));
id /= (k+1);
}
mysum += id; //last element
if ( mysum == sum ) atomicAdd( &count, 1 );
}
The kernel should be called with exactly (k+1)^n threads. If you happen to call your kernel with more threads (simply because rule of thumb that block dimension should be multiple of 32), you need to check value of tid inside your kernel beforehand. Also, cudaThreadSynchronize() is deprecated. Use cudaDeviceSynchronize() instead.