How to generate, compile and run CUDA kernels at runtime

前端未结

关注

 1  1065

独厮守ぢ 2021-02-15 23:51

Well, I have quite a delicate question :)

Let\'s start with what I have:

Data, large array of data, copied to GPU
Progra

1条回答

攒了一身酷 (楼主)

2021-02-16 00:23

In his comment, Roger Dahl has linked the following post

Passing the PTX program to the CUDA driver directly

in which the use of two functions, namely cuModuleLoad and cuModuleLoadDataEx, are addressed. The former is used to load PTX code from file and passing it to the nvcc compiler driver. The latter avoids I/O and enables to pass the PTX code to the driver as a C string. In either cases, you need to have already at your disposal the PTX code, either as the result of the compilation of a CUDA kernel (to be loaded or copied and pasted in the C string) or as an hand-written source.

But what happens if you have to create the PTX code on-the-fly starting from a CUDA kernel? Following the approach in CUDA Expression templates, you can define a string containing your CUDA kernel like

ss << "extern \"C\" __global__ void kernel( "; ss << def_line.str() << ", unsigned int vector_size, unsigned int number_of_used_threads ) { \n"; ss << "\tint idx = blockDim.x * blockIdx.x + threadIdx.x; \n"; ss << "\tfor(unsigned int i = 0; i < "; ss << "(vector_size + number_of_used_threads - 1) / number_of_used_threads; ++i) {\n"; ss << "\t\tif(idx < vector_size) { \n"; ss << "\t\t\t" << eval_line.str() << "\n"; ss << "\t\t\tidx += number_of_used_threads;\n"; ss << "\t\t}\n"; ss << "\t}\n"; ss << "}\n\n\n\n";

then using system calls to compile it as

int nvcc_exit_status = system( (std::string(NVCC) + " -ptx " + NVCC_FLAGS + " " + kernel_filename + " -o " + kernel_comp_filename).c_str() ); if (nvcc_exit_status) { std::cerr << "ERROR: nvcc exits with status code: " << nvcc_exit_status << std::endl; exit(1); }

and finally use cuModuleLoad and cuModuleGetFunction to load the PTX code from file and passing it to the compiler driver like

result = cuModuleLoad(&cuModule, kernel_comp_filename.c_str()); assert(result == CUDA_SUCCESS); result = cuModuleGetFunction(&cuFunction, cuModule, "kernel"); assert(result == CUDA_SUCCESS);

Of course, expression templates have nothing to do with this problem and I'm only quoting the source of the ideas I'm reporting in this answer.

0 讨论(0)

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复