How to create or manipulate GPU assembler?

后端 未结 6 1103
难免孤独
难免孤独 2020-12-13 07:30

Does any one have experience in creating/manipulating GPU machine code, possibly at run-time?

I am interested in modifying GPU assembler code, possibly at run tim

相关标签:
6条回答
  • 2020-12-13 07:42

    These links might be interesting for you, although its easy to find them, so probably you already seen it:

    http://www.worldlingo.com/ma/enwiki/en/ARB_(GPU_assembly_language)

    http://developer.nvidia.com/object/gpu_programming_guide.html

    http://developer.amd.com/gpu/Pages/default.aspx

    http://msdn.microsoft.com/en-us/library/bb219840.aspx

    http://www.khronos.org/opencl/

    http://www.comp.nus.edu.sg/~ashwinna/docs/CS6282_Modeling_the_GPU.pdf

    0 讨论(0)
  • 2020-12-13 07:43

    OpenCL is done for that purpose. You provide a program as a string and possibly compile it at runtime. See links provided by other poster.

    0 讨论(0)
  • 2020-12-13 07:43

    An assembler for the NVIDIA Fermi ISA: http://code.google.com/p/asfermi

    0 讨论(0)
  • 2020-12-13 07:44

    In the CUDA driver API, the module management functions allow an application to load at runtime a "module", which is (roughly) a PTX or cubin file. PTX is the intermediate language, while cubin is an already compiled set of instructions. cuModuleLoadData() and cuModuleLoadDataEx() appear to be capable of "loading" the module from a pointer in RAM, which means that no actual file is required.

    So your problem seems to be: how to programmatically build a cubin module in RAM ? As far as I know, NVIDIA never released details on the instructions actually understood by their hardware. There is, however, an independent opensource package called decuda which includes "cudasm", a assembler for what the "older" NVIDIA GPU understand ("older" = GeForce 8xxx and 9xxx). I do not know how easy it would be to integrate in a wider application; it is written in Python.

    Newer NVIDIA GPU use a distinct instruction set (how much distinct, I do not know), so a cubin for an old GPU ("computing capability 1.x" in NVIDIA/CUDA terminology) may not work on a recent GPU (computing capability 2.x, i.e. "Fermi architecture" such as a GTX 480). Which is why PTX is usually preferred: a given PTX file will be portable across GPU generations.

    0 讨论(0)
  • 2020-12-13 07:58

    I've found gpuocelot open-source (BSD Licence) project interesting.

    It's "a dynamic compilation framework for PTX". I would call it cpu translator.

    "Ocelot currently allows CUDA programs to be executed on NVIDIA GPUs, AMD GPUs, and x86-CPUs". As far as I know, this framework do control-flow and data-flow analysis on PTX Kernel in order to apply proper transformations.

    0 讨论(0)
  • 2020-12-13 08:02

    NVIDIA PTX generation and modification

    Not sure how low level it is compared to the hardware (likely undocumented?), but it can be generated from C/C++-like GPU languages, modified and reused in a few ways:

    • OpenCL clGetProgramInfo(program, CL_PROGRAM_BINARIES + clCreateProgramWithBinary: minimal runnable example: How to use clCreateProgramWithBinary in OpenCL?

      These are standardized OpenCL API's, which produce and consume implementation defined formats, which in driver version 375.39 for Linux happens to be human readable PTX.

      So you can dump the PTX, modify it, and reload.

    • nvcc: can compile CUDA GPU-side code to ptx assembly simply with either:

      nvcc --ptx a.cu
      

      nvcc can also compile OpenCL C programs containing both device and host code: Compile and build .cl file using NVIDIA's nvcc Compiler? but I could not find how to get the ptx out with nvcc. Which kind of makes sense since it is just plain C + C strings, and not a magic C superset. This is also suggested by: https://arrayfire.com/generating-ptx-files-from-opencl-code/

      And I'm not sure how to recompile the modified PTX and use it as I did with clCreateProgramWithBinary: How to compile PTX code

    Using clGetProgramInfo, an input CL kernel:

    __kernel void kmain(__global int *out) {
        out[get_global_id(0)]++;
    }
    

    gets compiled to some PTX like:

    //
    // Generated by NVIDIA NVVM Compiler
    //
    // Compiler Build ID: CL-21124049
    // Cuda compilation tools, release 8.0, V8.0.44
    // Based on LLVM 3.4svn
    //
    
    .version 5.0
    .target sm_20
    .address_size 64
    
        // .globl   _Z3incPi
    
    .visible .entry _Z3incPi(
        .param .u64 _Z3incPi_param_0
    )
    {
        .reg .pred  %p<2>;
        .reg .b32   %r<4>;
        .reg .b64   %rd<5>;
    
    
        ld.param.u64    %rd1, [_Z3incPi_param_0];
        mov.u32     %r1, %ctaid.x;
        setp.gt.s32 %p1, %r1, 2;
        @%p1 bra    BB0_2;
    
        cvta.to.global.u64  %rd2, %rd1;
        mul.wide.s32    %rd3, %r1, 4;
        add.s64     %rd4, %rd2, %rd3;
        ldu.global.u32  %r2, [%rd4];
        add.s32     %r3, %r2, 1;
        st.global.u32   [%rd4], %r3;
    
    BB0_2:
        ret;
    }
    

    Then if for example you modify the line:

    add.s32     %r3, %r2, 1;
    

    to:

    add.s32     %r3, %r2, 2;
    

    and reuse the PTX modified, it actually increments by 2 instead of 1 as expected.

    0 讨论(0)
提交回复
热议问题