I am reading Professional CUDA C Programming, and in GPU Architecture Overview section:
CUDA employs a Single Instruction Multiple Thread (SIMT)
There is no contradiction. All threads in a warp execute the same instruction in lock-step at all times. To support conditional execution and branching CUDA introduces two concepts in the SIMT model
Predicated execution means that the result of a conditional instruction can be used to mask off threads from executing a subsequent instruction without a branch. Instruction replay is how a classic conditional branch is dealt with. All threads execute all branches of the conditionally executed code by replaying instructions. Threads which do not follow a particular execution path are masked off and execute the equivalent of a NOP. This is the so-called branch divergence penalty in CUDA, because it has a significant impact on performance.
This is how lock-step execution can support branching.