efficiency of CUDA Scalar and SIMD video instructions

问题

The throughput of SIMD instruction is lower that 32-bits integer arithmetic. In case of SM2.0 (Scalar instruction only versions) is 2 time lower. In case of SM3.0 is 6 time lower.

What is a cases when suitable to use them ?

回答1:

If your data is already packed in a format that is handled natively by a SIMD video instruction, then it would require multiple steps to unpack it so that it can be handled by an ordinary instruction.

Furthermore, the throughput of a SIMD video instruction should also be multiplied by the number of actual operations performed when comparing it with ordinary arithmetic operations.

For example, for the instruction vadd4, 4 integer adds are being performed, on a packed 32-bit quantity (4 byte integer quantities). In order to duplicate this behavior with ordinary integer adds, a fairly complicated sequence of instructions would be needed to unpack the data into 4 int quantities, do 4 integer adds, and then re-pack the result. If you attempted to do it with a single integer add, carry from one result could corrupt the next result. vadd4 also offers clamping abilities and other behavior not available with integer add.

In the case of SM2.0, just the ratio of 4 operations performed by the vadd4 vs. the 4 integer adds necessary on unpacked data would make it attractive. In the case of SM3.0, when the unpacking and packing are added to the ordinary integer add routine, the vadd4 looks attractive. The situation becomes even more attractive with cc 5.0.

来源：https://stackoverflow.com/questions/24634943/efficiency-of-cuda-scalar-and-simd-video-instructions

标签

cuda

nvidia

simd