ptx | 易学教程

Some intrinsics named with `_sync()` appended in CUDA 9; semantics same?

阅读更多关于 Some intrinsics named with `_sync()` appended in CUDA 9; semantics same?

问题 In CUDA 9, nVIDIA seems to have this new notion of "cooperative groups"; and for some reason not entirely clear to me, __ballot() is now (= CUDA 9) deprecated in favor of __ballot_sync() . Is that an alias or have the semantics changed? ... similar question for other builtins which now have __sync() added to their names. 回答1: No the semantics are not the same. The function calls themselves are different, one is not an alias for another, new functionality has been exposed, and the

how to find the active SMs?

阅读更多关于 how to find the active SMs?

问题 Is there any way by which I can know the number of free/active SMs? Or atleast to read the voltage/power or temperature values of each SM by which I can know whether its working or not? (in real time while some job is getting executed on the gpu device). %smid helped me in knowing the Id of each SM. Something similar would be helpful. Thanks and Regards, Rakesh 回答1: The CUDA Profiling Tools Interface (CUPTI) contains an Events API that enables run time sampling of GPU PM counters. The CUPTI

Detecting ptx kernel of Thrust transform

阅读更多关于 Detecting ptx kernel of Thrust transform

问题 I have following thrust::transform call. my_functor *f_1 = new my_functor(); thrust::transform(data.begin(), data.end(), data.begin(),*f_1); I want to detect it's corresponding kernel in PTX file. But there are many kernels containing my_functor in their mangled names. For example- _ZN6thrust6system4cuda6detail6detail23launch_closure_by_valueINS2_17for_each_n_detail18for_each_n_closureINS_12zip_iteratorINS_5tupleINS_6detail15normal_iteratorINS_10device_ptrIiEEEESD_NS_9null_typeESE_SE_SE_SE_SE

Detecting ptx kernel of Thrust transform

阅读更多关于 Detecting ptx kernel of Thrust transform

A Method of counting Floating Point Operations in a C++/CUDA Program using PTX

阅读更多关于 A Method of counting Floating Point Operations in a C++/CUDA Program using PTX

问题 I have a somewhat large CUDA application and I need to calculate the attained GFLOPs. I'm looking for an easy and perhaps generic way of counting the number of floating point operations. Is it possible to count floating point operations from the generated PTX code (as shown below), using a list of predefined fpo in assembly language? Based on the code, can the counting be made generic? For example, does add.s32 %r58, %r8, -2; count as one floating point operation? EXAMPLE: BB3_2: .loc 2 108 1

CUDA/PTX 32-bit vs. 64-bit

阅读更多关于 CUDA/PTX 32-bit vs. 64-bit

问题 CUDA compilers have options for producing 32-bit or 64-bit PTX. What is the difference between these? Is it like for x86, NVidia GPUs actually have 32-bit and 64-bit ISAs? Or is it related to host code only? 回答1: Pointers are certainly the most obvious difference. 64 bit machine model enables 64-bit pointers. 64 bit pointers enable a variety of things, such as address spaces larger than 4GB, and unified virtual addressing. Unified virtual addressing in turn enables other things, such as

How to generate, compile and run CUDA kernels at runtime

阅读更多关于 How to generate, compile and run CUDA kernels at runtime

问题 Well, I have quite a delicate question :) Let's start with what I have: Data , large array of data, copied to GPU Program , generated by CPU (host), which needs to be evaluated for every data in that array The program changes very frequently, can be generated as CUDA string, PTX string or something else (?) and needs to be re-evaluated after each change What I want: Basically just want to make this as effective (fast) as possible, eg. avoid compilation of CUDA to PTX. Solution can be even

load function parameters in inlined ptx

阅读更多关于 load function parameters in inlined ptx

问题 I have the following function with inline assembly that works fine on debug mode in 32 bit Visual Studio 2008: __device__ void add(int* pa, int* pb) { asm(".reg .u32 s<3>;"::); asm(".reg .u32 r<14>;"::); asm("ld.global.b32 s0, [%0];"::"r"(&pa)); //load addresses of pa, pb printf(...); asm("ld.global.b32 s1, [%0];"::"r"(&pb)); printf(...); asm("ld.global.b32 r1, [s0+8];"::); printf(...); asm("ld.global.b32 r2, [s1+8];"::); printf(...); ...// perform some operations } pa and pb are globally

CUDA disable L1 cache only for one variable

阅读更多关于 CUDA disable L1 cache only for one variable

问题 Is there any way on CUDA 2.0 devices to disable L1 cache only for one specific variable? I know that one can disable L1 cache at compile time adding the flag -Xptxas -dlcm=cg to nvcc for all memory operations. However, I want to disable cache only for memory reads upon a specific global variable so that all of the rest of the memory reads to go through the L1 cache. Based on a search I have done in the web, a possible solution is through PTX assembly code. 回答1: As mentioned above you can use

CUDA: How to use -arch and -code and SM vs COMPUTE

阅读更多关于 CUDA: How to use -arch and -code and SM vs COMPUTE

问题 I am still not sure how to properly specify the architectures for code generation when building with nvcc. I am aware that there is machine code as well as PTX code embedded in my binary and that this can be controlled via the controller switches -code and -arch (or a combination of both using -gencode ). Now, according to this apart from the two compiler flags there are also two ways of specifying architectures: sm_XX and compute_XX , where compute_XX refers to a virtual and sm_XX to a real