How to generate, compile and run CUDA kernels at runtime

前端 未结 1 1066
独厮守ぢ
独厮守ぢ 2021-02-15 23:51

Well, I have quite a delicate question :)

Let\'s start with what I have:

  1. Data, large array of data, copied to GPU
  2. Progra
相关标签:
1条回答
  • 2021-02-16 00:23

    In his comment, Roger Dahl has linked the following post

    Passing the PTX program to the CUDA driver directly

    in which the use of two functions, namely cuModuleLoad and cuModuleLoadDataEx, are addressed. The former is used to load PTX code from file and passing it to the nvcc compiler driver. The latter avoids I/O and enables to pass the PTX code to the driver as a C string. In either cases, you need to have already at your disposal the PTX code, either as the result of the compilation of a CUDA kernel (to be loaded or copied and pasted in the C string) or as an hand-written source.

    But what happens if you have to create the PTX code on-the-fly starting from a CUDA kernel? Following the approach in CUDA Expression templates, you can define a string containing your CUDA kernel like

    ss << "extern \"C\" __global__ void kernel( ";
    ss << def_line.str() << ", unsigned int vector_size, unsigned int number_of_used_threads ) { \n";
    ss << "\tint idx = blockDim.x * blockIdx.x + threadIdx.x; \n";
    ss << "\tfor(unsigned int i = 0; i < ";
    ss << "(vector_size + number_of_used_threads - 1) / number_of_used_threads; ++i) {\n";
    ss << "\t\tif(idx < vector_size) { \n";
    ss << "\t\t\t" << eval_line.str() << "\n";
    ss << "\t\t\tidx += number_of_used_threads;\n";
    ss << "\t\t}\n";
    ss << "\t}\n";
    ss << "}\n\n\n\n";
    

    then using system calls to compile it as

    int nvcc_exit_status = system(
             (std::string(NVCC) + " -ptx " + NVCC_FLAGS + " " + kernel_filename 
                  + " -o " + kernel_comp_filename).c_str()
        );
    
        if (nvcc_exit_status) {
                std::cerr << "ERROR: nvcc exits with status code: " << nvcc_exit_status << std::endl;
                exit(1);
        }
    

    and finally use cuModuleLoad and cuModuleGetFunction to load the PTX code from file and passing it to the compiler driver like

        result = cuModuleLoad(&cuModule, kernel_comp_filename.c_str());
        assert(result == CUDA_SUCCESS);
        result =  cuModuleGetFunction(&cuFunction, cuModule, "kernel");
        assert(result == CUDA_SUCCESS);
    

    Of course, expression templates have nothing to do with this problem and I'm only quoting the source of the ideas I'm reporting in this answer.

    0 讨论(0)
提交回复
热议问题