I want a measure of how much of the peak performance my kernel archives.
Say I have a NVIDIA Tesla C1060, which has a peak GFLOPS of 622.08 (~= 240Cores * 1300MHz * 2
Nsight VSE (>3.2) and the Visual Profiler (>=5.5) support Achieved FLOPs calculation. In order to collect the metric the profilers run the kernel twice (using kernel replay). In the first replay the number of floating point instructions executed is collected (with understanding of predication and active mask). in the second replay the duration is collected.
nvprof and Visual Profiler have a hardcoded definition. FMA counts as 2 operations. All other operations are 1 operation. The flops_sp_* counters are thread instruction execution counts whereas flops_sp is the weighted sum so some weighting can be applied using the individual metrics. However, flops_sp_special covers a number of different instructions.
The Nsight VSE experiment configuration allows the user to define the operations per instruction type.
Configuring to collect Achieved FLOPS
Viewing Achieved FLOPS
Metrics Available (on a K20)
nvprof --query-metrics | grep flop
flops_sp: Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, multiply-accumulate and special)
flops_sp_add: Number of single-precision floating-point add operations executed by non-predicated threads
flops_sp_mul: Number of single-precision floating-point multiply operations executed by non-predicated threads
flops_sp_fma: Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads
flops_dp: Number of double-precision floating-point operations executed non-predicated threads (add, multiply, multiply-accumulate and special)
flops_dp_add: Number of double-precision floating-point add operations executed by non-predicated threads
flops_dp_mul: Number of double-precision floating-point multiply operations executed by non-predicated threads
flops_dp_fma: Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads
flops_sp_special: Number of single-precision floating-point special operations executed by non-predicated threads
flop_sp_efficiency: Ratio of achieved to peak single-precision floating-point operations
flop_dp_efficiency: Ratio of achieved to peak double-precision floating-point operations
Collection and Results
nvprof --devices 0 --metrics flops_sp --metrics flops_sp_add --metrics flops_sp_mul --metrics flops_sp_fma matrixMul.exe
[Matrix Multiply Using CUDA] - Starting...
==2452== NVPROF is profiling process 2452, command: matrixMul.exe
GPU Device 0: "Tesla K20c" with compute capability 3.5
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 6.18 GFlop/s, Time= 21.196 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK
Note: For peak performance, please refer to the matrixMulCUBLAS example.
==2452== Profiling application: matrixMul.exe
==2452== Profiling result:
==2452== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla K20c (0)"
Kernel: void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
301 flops_sp FLOPS(Single) 131072000 131072000 131072000
301 flops_sp_add FLOPS(Single Add) 0 0 0
301 flops_sp_mul FLOPS(Single Mul) 0 0 0
301 flops_sp_fma FLOPS(Single FMA) 65536000 65536000 65536000
NOTE: flops_sp = flops_sp_add + flops_sp_mul + flops_sp_special + (2 * flops_sp_fma) (approximately)
The Visual Profiler supports the metrics shown in the nvprof section above.
First some general remarks:
In general, what you are doing is mostly an exercise in futility and is the reverse of how most people would probably go about performance analysis.
The first point to make is that the peak value you are quoting is for strictly for floating point multiply-add instructions (FMAD), which count as two FLOPS, and can be retired at a maximum rate of one per cycle. Other floating point operations which retire at a maximum rate of one per cycle would formally only be classified as a single FLOP, while others might require many cycles to be retired. So if you decided to quote kernel performance against that peak, you are really comparing your codes performance against a stream of pure FMAD instructions, and nothing more than that.
The second point is that when researchers quote FLOP/s values from a piece of code, they are usually using a model FLOP count for the operation, not trying to count instructions. Matrix multiplication and the Linpack LU factorization benchmarks are classic examples of this approach to performance benchmarking. The lower bound of the operation count of those calculations is exactly known, so the calculated throughput is simply that lower bound divided by the time. The actual instruction count is irrelevent. Programmers often use all sorts of techniques, including rundundant calculations, speculative or predictive calculations, and a host of other ideas to make code run faster. The actual FLOP count of such code is irrelevent, the reference is always the model FLOP count.
Finally, when looking at quantifying performance, there are usually only two points of comparison of any real interest
In the first case you really only need to measure execution time. In the second, a suitable measure usually isn't FLOP/s, it is useful operations per unit time (records per second in sorting, cells per second in a fluid mechanical simulation, etc). Sometimes, as mentioned above, the useful operations can be the model FLOP count of an operation of known theoretical complexity. But the actual floating point instruction count rarely, if ever, enters into the analysis.
If your interest is really about optimization and understanding the performance of your code, then maybe this presentation by Paulius Micikevicius from NVIDIA might be of interest.
Addressing the bullet point questions:
Is this approach correct?
Strictly speaking, no. If you are counting floating point operations, you would need to know the exact FLOP count from the code the GPU is running. The sqrt
operation can consume a lot more than a single FLOP, depending on its implementation and the characteristics of the number it is operating on, for example. The compiler can also perform a lot of optimizations which might change the actual operation/instruction count. The only way to get a truly accurate count would be to disassemble compiled code and count the individual floating point operands, perhaps even requiring assumptions about the characteristics of values the code will compute.
What about comparisons (if(a>b) then....)? Do I have to consider them as well?
They are not floating point multiply-add operations, so no.
Can I use the CUDA profiler for easier and more accurate results? I tried the instructions counter, but I could not figure out, what the figure means.
Not really. The profiler can't differentiate between a floating point intruction and any other type of instruction, so (as of 2011) FLOP count from a piece of code via the profiler is not possible. [EDIT: see Greg's execellent answer below for a discussion of the FLOP counting facilities available in versions of the profiling tools released since this answer was written]