Can 1 CUDA-core to process more than 1 float-point-instruction per clock (Maxwell)?

后端 未结 1 1057
抹茶落季
抹茶落季 2021-01-03 08:47

List of Nvidia GPU - GeForce 900 Series - there is written that:

4 Single precision performance is calculated as 2 times the number

相关标签:
1条回答
  • 2021-01-03 09:33

    Summary: One FMA counts as 2 FLOPs in the standard accounting of FP throughput, even on machines that do it in a single instruction for a single execution unit (which is how it avoids intermediate rounding, the fused part of FMA).


    A CUDA "core" (also called SP - streaming processor) is most commonly referring to the single-precision floating point units in an SM (streaming multiprocessor). A CUDA core can initiate one single precision floating point instruction per clock cycle. (The unit is pipelined, so it can initiate one instruction per clock, and it can retire one instruction per clock, but it cannot fully process a given instruction in a given clock cycle.)

    If that instruction is for example, a single-precision add or single precision multiply, then that core can contribute one floating point operation per clock, since an add or multiply counts as one floating point operation. If, on the other hand, the instruction is an FMA instruction (floating point multiply-add) then the core will perform both a floating point multiply AND a floating point add operation in the same time period. This means that effectively two operations are performed by a single instruction. This usage of FMA gives rise to the 2 multiplier when computing peak theoretical throughput.

    So a core can only process (i.e. initiate, retire) a single instruction per clock, but if that instruction is an FMA, it counts as two floating point operations.

    0 讨论(0)
提交回复
热议问题