问题
I am trying to parallelize a function that contains several procedures. The function goes:
void _myfunction(M1,M2){
for (a = 0; a < A; a++) {
Amatrix = procedure1(M1) /*contains for loops*/;
Bmatrix = procedure2(M1) /*contains for loops*/;
...
for ( z = 1 ; z < Z ; z++ ){
calculations with Amatrix(z) and obtain AAmatrix
calculations with Bmatrix(z) and obtain BBmatrix
for ( e = 1; e < E; e++) {
calculations with AAmatrix(e) and obtain CCmatrix
calculations with BBmatrix(e) and obtain DDmatrix
}
}
for (q = 0; q < Q; q++){ calculations with CCMatrix(q) }
for (m = 0; m < M; m++){ calculations with DDMatrix(q) }
}
}
Concerning the functions procedure1()
and procedure2()
, I have ported them to CUDA and everything is going fine (each of these procedures have their own for loops).
The reason that these procedures are separated is because they are conceptually independent algorithms, opposite to the rest of the code that has a more general concept.
Now I am trying to port the rest of the code to CUDA, but I am not sure about what to do. Of course, I want to keep the same structure of the entire function, if it is possible. My first thought was to transform the function _myfunction(arg1,arg2,..)
into a kernel but my problem is that there are already two kernel function that are executed in order inside. Somewhere I have read that we can use streams, but again I am not sure how to do it and if it is correct.
Question: Can somebody give a hint on how to port a program to CUDA?
P.S: I am using GeForce 9600GT (Compute Capability 1.1) and the CUDA Toolkit 5.0.
回答1:
The same structure theory might not be achievable in CUDA because the problem might not be parallelizable. That's basically due to the nature of the problem. In your device you cannot launch a kernel from within another kernel. This mechanism is called Dynamic Parallelism and is very recent. Compute Capability 1.1
doesn't support this. To my knowledge the Dynamic Parallelism is introduced since CUDA Kepler architecture. You'd have to make a bit of research to check out which devices support this (of course if you are interested). Summing up, you won't be able to achieve this with the same structure theory. But that doesn't mean you cannot achieve it at all.
Here are my recommendations in order to port your, and any other, program:
- Read CUDA C Programming Guide and CUDA C Best Practices Guide (assuming you use CUDA C)
- Restructure/rethink the original problem and see if it can be parallelized.
- Perform a static analysis of your code. (basically reading the code and according you programming knowledge make things faster)
- Perform a dynamic analysis of your code. You can achieve this through tools. I would recommend Valgrind. It has wide usage, it's free, it has a lot of different modules which help you inspect different aspects of your program, and it's supported in a lot of platforms. I used it and I think is good
- After this two analysis you look for problematic points in your program, e.g. that take most of the execution time of the program.
- Try to parallelize those point. As I said the structure doesn't have to be the same.
Note#1: As your a newbie the first two reading are mandatory otherwise you'd spend a lot in debugging. Note#2: If you don't find problematic points in your program I would highly doubt you could speed up your code with CUDA. But this is an extreme case, I would say.
来源:https://stackoverflow.com/questions/17476622/porting-a-program-to-cuda-kernel-inside-another-kernel