How can I call a ptx function from CUDA C?

前端未结

关注

 2  1025

I am trying to find a way to call a ptx function (.func) from CUDA C. Say I had a ptx function like this:

.func (.reg .s32 %res) inc_ptr ( .reg .s32 %ptr, .reg .


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  小鲜肉        
                
              
                            
                2021-01-21 10:14
              
            
            
                                                                       
As far as i know, CUDA C supports asm, there is a doc which locate in doc directory after cuda tool kit installed.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  眼角桃花        
                
              
                            
                2021-01-21 10:27
              
            
            
                                                                       
This can be done using the separate compilation facilities introduced with CUDA 5.0. I don't believe there is a way to do this in "whole" program compilation mode or in toolkit versions prior to CUDA 5.0 or in PTX revisions prior to 3.1.

It is probably easiest to illustrate how to do this with a worked example. Let's start with a simple PTX function for incrementing pointers, similar to your example:

.version 3.1
.target sm_30
.address_size 32
.visible .func inc_ptr(.param .b32 ptr, .param .b32 inc)
{
    .reg .s32   %r<6>;
    ld.param.u32 %r1, [ptr];
    ld.param.u32 %r2, [inc];
    ld.u32 %r3, [%r1];
    ld.u32 %r4, [%r3];
    add.s32 %r5, %r4, %r2;
    st.u32  [%r3], %r5;
    ret;
}


This can be compiled to a relocatable device object using ptxas and then packed into a fatbinary container file. The latter step seems to be critical. The default ptxas output is only a relocatable elf object, there is no fatbinary container produced. It seems that the device code linkage phase that nvcc runs (at least in CUDA 5) is expecting all device code is present in fatbinary containers. The linkage will fail otherwise. The result looks like this:

$ ptxas -arch=sm_30 -c -o inc_ptr.gpu.o inc_ptr.ptx
$ fatbinary -arch=sm_30 -create inc_ptr.fatbin -elf inc_ptr.gpu.o 
$ cuobjdump -sass inc_ptr.fatbin 

Fatbin elf code:
================
arch = sm_30
code version = [1,6]
producer = <unknown>
host = mac
compile_size = 32bit

    code for sm_30
        Function : inc_ptr
    /*0008*/     /*0x0040dc8580000000*/     LD R3, [R4];
    /*0010*/     /*0x00301c8580000000*/     LD R0, [R3];
    /*0018*/     /*0x14001c0348000000*/     IADD R0, R0, R5;
    /*0020*/     /*0x00301c8590000000*/     ST [R3], R0;
    /*0028*/     /*0x00001de790000000*/     RET;
    /*0030*/     /*0x00001de440000000*/     NOP CC.T;
    /*0038*/     /*0x00001de440000000*/     NOP CC.T;
    /*0040*/     /*0xe0001de74003ffff*/     BRA 0x40;
    /*0048*/     /*0x00001de440000000*/     NOP CC.T;
    /*0050*/     /*0x00001de440000000*/     NOP CC.T;
    /*0058*/     /*0x00001de440000000*/     NOP CC.T;
    /*0060*/     /*0x00001de440000000*/     NOP CC.T;
    /*0068*/     /*0x00001de440000000*/     NOP CC.T;
    /*0070*/     /*0x00001de440000000*/     NOP CC.T;
    /*0078*/     /*0x00001de440000000*/     NOP CC.T;
        ........................


You can see that the fatbinary contains the microcode from the assembled PTX. With the device function fatbin prepared, you can do something like this in CUDA C code:

extern "C" __device__ void inc_ptr(int* &ptr, const int inc);

__global__
void memsetkernel(int *inout, const int val, const int N)
{
    int stride = blockDim.x * gridDim.x;
    int *p = inout;
    inc_ptr(p, threadIdx.x + blockDim.x*blockIdx.x);

    for(; p < inout+N; inc_ptr(p, stride)) *p = val;
}  


int main(void)
{
    const int n=10;
    int *p;
    cudaMalloc((void**)&p, sizeof(int)*size_t(n));
    memsetkernel<<<1,32>>>(p, 5, n);

    return 0;
}


In separate compilation mode, the device code toolchain will respect the extern declaration and (as long as you get symbol mangling under control), the device function fatbinary can be linked with other device and host code to produce a final object:

$ nvcc -arch=sm_30 -Xptxas="-v" -dlink -o memset.out inc_ptr.fatbin memset_kernel.cu 

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z12memsetkernelPiii' for 'sm_30'
ptxas info    : Function properties for _Z12memsetkernelPiii
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 20 registers, 332 bytes cmem[0]

$ cuobjdump -sass memset.out 

Fatbin elf code:
================
arch = sm_30
code version = [1,6]
producer = <unknown>
host = mac
compile_size = 32bit
identifier = inc_ptr.fatbin memset_kernel.cu 

    code for sm_30
        Function : _Z12memsetkernelPiii
    /*0008*/     /*0x10005de428004001*/     MOV R1, c [0x0] [0x44];
    /*0010*/     /*0x20105d034800c000*/     IADD R1, R1, -0x8;
    /*0018*/     /*0x00019de428004005*/     MOV R6, c [0x0] [0x140];
    /*0020*/     /*0x10101c034800c000*/     IADD R0, R1, 0x4;
    /*0028*/     /*0x8400dc042c000000*/     S2R R3, SR_Tid_X;
    /*0030*/     /*0x90041c0348004000*/     IADD R16, R0, c [0x0] [0x24];
    /*0038*/     /*0x94001c042c000000*/     S2R R0, SR_CTAid_X;
    /*0048*/     /*0xd0009de428004000*/     MOV R2, c [0x0] [0x34];
    /*0050*/     /*0x91045d0348004000*/     IADD R17, R16, -c [0x0] [0x24];
    /*0058*/     /*0x40011de428000000*/     MOV R4, R16;
    /*0060*/     /*0xa0015ca320064000*/     IMAD R5, R0, c [0x0] [0x28], R3;
    /*0068*/     /*0x01119c85c8000000*/     STL [R17], R6;
    /*0070*/     /*0xa0209ca350004000*/     IMUL R2, R2, c [0x0] [0x28];
    /*0078*/     /*0x0001000710000000*/     JCAL 0x0;
    /*0088*/     /*0x0110dc85c0000000*/     LDL R3, [R17];
    /*0090*/     /*0x20001de428004005*/     MOV R0, c [0x0] [0x148];
    /*0098*/     /*0x00049c4340004005*/     ISCADD R18, R0, c [0x0] [0x140], 0x2;
    /*00a0*/     /*0x4831dc031b0e0000*/     ISETP.GE.U32.AND P0, pt, R3, R18, pt;
    /*00a8*/     /*0x000001e780000000*/     @P0 EXIT;
    /*00b0*/     /*0x1004dde428004005*/     MOV R19, c [0x0] [0x144];
    /*00b8*/     /*0x0034dc8590000000*/     ST [R3], R19;
    /*00c8*/     /*0x40011de428000000*/     MOV R4, R16;
    /*00d0*/     /*0x08015de428000000*/     MOV R5, R2;
    /*00d8*/     /*0x0001000710000000*/     JCAL 0x0;
    /*00e0*/     /*0x0110dc85c0000000*/     LDL R3, [R17];
    /*00e8*/     /*0x4831dc03188e0000*/     ISETP.LT.U32.AND P0, pt, R3, R18, pt;
    /*00f0*/     /*0x000001e74003ffff*/     @P0 BRA 0xb8;
    /*00f8*/     /*0x00001de780000000*/     EXIT;
    /*0100*/     /*0xe0001de74003ffff*/     BRA 0x100;
    /*0108*/     /*0x00001de440000000*/     NOP CC.T;
    /*0110*/     /*0x00001de440000000*/     NOP CC.T;
    /*0118*/     /*0x00001de440000000*/     NOP CC.T;
    /*0120*/     /*0x00001de440000000*/     NOP CC.T;
    /*0128*/     /*0x00001de440000000*/     NOP CC.T;
    /*0130*/     /*0x00001de440000000*/     NOP CC.T;
    /*0138*/     /*0x00001de440000000*/     NOP CC.T;
        .....................................


        Function : inc_ptr
    /*0008*/     /*0x0040dc8580000000*/     LD R3, [R4];
    /*0010*/     /*0x00301c8580000000*/     LD R0, [R3];
    /*0018*/     /*0x14001c0348000000*/     IADD R0, R0, R5;
    /*0020*/     /*0x00301c8590000000*/     ST [R3], R0;
    /*0028*/     /*0x00001de790000000*/     RET;
    /*0030*/     /*0x00001de440000000*/     NOP CC.T;
    /*0038*/     /*0x00001de440000000*/     NOP CC.T;
    /*0040*/     /*0xe0001de74003ffff*/     BRA 0x40;
    /*0048*/     /*0x00001de440000000*/     NOP CC.T;
    /*0050*/     /*0x00001de440000000*/     NOP CC.T;
    /*0058*/     /*0x00001de440000000*/     NOP CC.T;
    /*0060*/     /*0x00001de440000000*/     NOP CC.T;
    /*0068*/     /*0x00001de440000000*/     NOP CC.T;
    /*0070*/     /*0x00001de440000000*/     NOP CC.T;
    /*0078*/     /*0x00001de440000000*/     NOP CC.T;
        ........................


There might be other tricks which can be played with the toolchain to achieve this, but this approach certainly works.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复