I want to get sample data per instruction. It turned out such tool is a little bit difficult to find.
The image below is a good example from Nvidia Nsight compute for