I have the following code for a CUDA program:
#include
#define NUM_BLOCKS 4
#define THREADS_PER_BLOCK 4
__global__ void hello()
{
printf(
However, the thread order in every block is always 0,1,2,3. Why is this happening? I thought it would be random too
With 4 threads per block you are only launching one warp per block. A warp is the unit of execution (and scheduling, and resource assignment) in CUDA, not a thread. Currently, a warp consists of 32 threads.
This means that all 4 of your threads per block (since there is no conditional behavior in this case) are executing in lockstep. When they reach the printf
function call, they all execute the call to that function in the same line of code, in lockstep.
So the question becomes, in this situation, how does the CUDA runtime dispatch these "simultaneous" function calls? The answer to that question is unspecified, but it is not "random". Therefore it's reasonable that the order of dispatch for operations within a warp does not change from run to run.
If you launch enough threads to create multiple warps per block, and probably also include some other code to disperse and or "randomize" the behavior between warps, you should be able to see printf
operations emanating from separate warps occurring in "random" order.
To answer the second part of your question, when control flow diverges at the if
statement, the threads where threadIdx.x != 0
simply wait to at the convergence point after the if
statement. They do not go on to the printf
statement until thread 0 has completed the if
block.