Almost anywhere I read about programming with CUDA there is a mention of the importance that all of the threads in a warp do the same thing.
In my code I have a situation wh
In Gabriel's response:
"Global instruction synchronization is not possible within a kernel. If you need f() done on all threads before calling g() on any thread, split f() and g() into two different kernels and call them serially from the host."
What if the reason you need f() and g() in same thread is because you're using register memory, and you want register or shared data from f to get to g? That is, for my problem, the whole reason for synchronizing across blocks is because data from f is needed in g - and breaking out to a kernel would require a large amount of additional global memory to transfer register data from f to g, which I'd like to avoid