Branch and predicated instructions

后端 未结 1 768
隐瞒了意图╮
隐瞒了意图╮ 2020-12-15 22:49

Section 5.4.2 of the CUDA C Programming Guide states that branch divergence is handled either by \"branch instructions\" or, under certain conditions, \"predicated instructi

相关标签:
1条回答
  • 2020-12-15 23:09

    Instruction predication means that an instruction is conditionally executed by a thread depending on a predicate. Threads for which the predicate is true execute the instruction, the rest do nothing.

    For example:

    var = 0;
    
    // Not taken by all threads
    if (condition) {
        var = 1;
    } else {
        var = 2;
    }
    
    output = var;
    

    Would result in (not actual compiler output):

           mov.s32 var, 0;       // Executed by all threads.
           setp pred, condition; // Executed by all threads, sets predicate.
    
    @pred  mov.s32 var, 1;       // Executed only by threads where pred is true.
    @!pred mov.s32 var, 2;       // Executed only by threads where pred is false.
           mov.s32 output, var;  // Executed by all threads.
    

    All in all, that's 3 instructions for the if, no branching. Very efficient.

    The equivalent code with branches would look like:

           mov.s32 var, 0;       // Executed by all threads.
           setp pred, condition; // Executed by all threads, sets predicate.
    
    @!pred bra IF_FALSE;         // Conditional branches are predicated instructions.
    IF_TRUE:                    // Label for clarity, not actually used.
           mov.s32 var, 1;
           bra IF_END;
    IF_FALSE:
           mov.s32 var, 2;
    IF_END:
           mov.s32 output, var;
    

    Notice how much longer it is (5 instructions for the if). The conditional branch requires disabling part of the warp, executing the first path, then rolling back to the point where the warp diverged and executing the second path until both converge. It takes longer, requires extra bookkeeping, more code loading (particularly in the case where there are many instructions to execute) and hence more memory requests. All that make branching slower than simple predication.

    And actually, in the case of this very simple conditional assignment, the compiler can do even better, with only 2 instructions for the if:

    mov.s32 var, 0;       // Executed by all threads.
    setp pred, condition; // Executed by all threads, sets predicate.
    selp var, 1, 2, pred; // Sets var depending on predicate (true: 1, false: 2).
    
    0 讨论(0)
提交回复
热议问题