I want to parallelize a function in CUDA C which will count all vectors with sum equal of vector elements and elements not bigger than k. For example if the number of vector ele
The problem is __syncthreads(). For a __syncthreads() to work properly, all the threads inside the block should be able to reach it otherwise some threads wait forever and your program doesn't get out. In your program, execution of __syncthreads() in some parts is conditional. That's the reason why your program doesn't work with more than one thread in one block.