ptxas

Interpreting the verbose output of ptxas, part II

[亡魂溺海] 提交于 2019-12-11 05:24:44
问题 This question is a continuation of Interpreting the verbose output of ptxas, part I . When we compile a kernel .ptx file with ptxas -v , or compile it from a .cu file with -ptxas-options=-v , we get a few lines of output such as: ptxas info : Compiling entry function 'searchkernel(octree, int*, double, int, double*, double*, double*)' for 'sm_20' ptxas info : Function properties for searchkernel(octree, int*, double, int, double*, double*, double*) 72 bytes stack frame, 0 bytes spill stores,

Interpreting the verbose output of ptxas, part I

北慕城南 提交于 2019-11-28 21:22:12
I am trying to understand resource usage for each of my CUDA threads for a hand-written kernel. I compiled my kernel.cu file to a kernel.o file with nvcc -arch=sm_20 -ptxas-options=-v and I got the following output (passed through c++filt ): ptxas info : Compiling entry function 'searchkernel(octree, int*, double, int, double*, double*, double*)' for 'sm_20' ptxas info : Function properties for searchkernel(octree, int*, double, int, double*, double*, double*) 72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 46 registers, 176 bytes cmem[0], 16 bytes cmem[14]

Interpreting the verbose output of ptxas, part I

本小妞迷上赌 提交于 2019-11-27 13:45:44
问题 I am trying to understand resource usage for each of my CUDA threads for a hand-written kernel. I compiled my kernel.cu file to a kernel.o file with nvcc -arch=sm_20 -ptxas-options=-v and I got the following output (passed through c++filt ): ptxas info : Compiling entry function 'searchkernel(octree, int*, double, int, double*, double*, double*)' for 'sm_20' ptxas info : Function properties for searchkernel(octree, int*, double, int, double*, double*, double*) 72 bytes stack frame, 0 bytes

How can I implement a custom atomic function involving several variables?

萝らか妹 提交于 2019-11-26 06:46:26
问题 I\'d like to implement this atomic function in CUDA: __device__ float lowest; // global var __device__ int lowIdx; // global var float realNum; // thread reg var int index; // thread reg var if(realNum < lowest) { lowest= realNum; // the new lowest lowIdx= index; // update the \'low\' index } I don\'t believe I can do this with any of the atomic functions. I need to lock down a couple global memory loc\'s for a couple instructions. Might I be able to implement this with PTXAS (assembly) code?