I have following setup, on top level parent kernel is called:
parent_kernel<<>>(...)
The only thing it does is cal
Let's cover the ground already covered in this answer as one possible approach.
The overall concepts are as follows.
based on parent_kernel<<<a, 1>>>(...)
, we have a
sequences of child kernel launches to perform. Each item in a
consists of child_kernel_1
followed by child_kernel_2
. There is a large amount of temporary data that needs to be passed from child 1 to child 2, and we do not wish to preallocate for all a
quantity of such temporary data.
We observe that for each SM in our GPU, there are a maximum number of resident blocks X
possible; this is a CUDA hardware limit that is queryable at runtime (e.g. deviceQuery
sample code).
Let's suppose we have W
of these SMs in our GPU (also queryable at runtime). And let's suppose for each SM, the hardware limit on resident blocks is X
. This means that we should only need to provide for W*X
temporary allocations, and if W*X
is less than a
, we may have an avenue to conduct this problem with a reduced temporary allocation size. (To implement this step correctly, X
may need to be reduced based on occupancy analysis of the kernel in question.)
In order to use this avenue, we will need to limit the total number of blocks we launch so as to only have X
per SM, i.e. we must launch W*X
blocks. Since this is less than a
(presumably), we must recraft the parent kernel design from:
child_kernel_1<<<c1, b1>>>(... + offset(blockIdx.x), blockIdx.x)
child_kernel_2<<<c2, b2>>>(... + offset(blockIdx.x), blockIdx.x)
to:
for (int i = 0; i < a/(W*X); i++){
child_kernel_1<<<c1, b1>>>(i, ... + offset(blockIdx.x), blockIdx.x)
child_kernel_2<<<c2, b2>>>(i, ... + offset(blockIdx.x), blockIdx.x)}
(assuming for simplicity that a
is whole-number divisible by W*X
, but this can easily be addressed with a limit check) (Also note, this idea to limit the total blocks isn't absolutely necessary, but it considerably simplifies the per-SM allocation scheme. See below for an alternate method.)
The GPU block distributor will distribute the blocks to each SM, eventually reaching a full load of X
blocks per SM. As each block becomes resident on a SM it begins to execute code and does two things first: A. determine which SM I am on, let's call this w
, where w
can take on values from 0
to W-1
. B. determine which block number am I on this SM. This is done with a simple atomicAdd
and is not a lock in any sense I am aware of that usage. Let's call the number returned by this atomicAdd
operation as x
, where x
can range from 0 to X-1
.
Each block now has a unique (w,x)
ordered pair. With this ordered pair it can select from a set of W*X
preallocated temporary storage areas.
Each block is now performing a/(W*X)
sequences of child 1 and child 2 launches, and reuses its already selected temporary storage allocation for each 1-2 sequence.
Much of the code needed to realize the above is already in the other answer here (method 3).
We can also relax the constraint on the method above where we restrict the total number of blocks launched to how many can be simultaneously co-resident. Instead, we can keep track of all slots, and make them available as blocks launch and retire. For this case, each block must signal when it is finished. Here is a lightly tested example:
#include <iostream>
#include <cassert>
const long long DELAY_T = 100000;
// this is used to get one of a set of unique slots on the SM
//const unsigned long long slots = 0xFFFFFFFFULL; // 0xFFFFFFFF assumes 32 unique slots per SM
const int max_num_slots = 32;
const unsigned long long busy = 0x1FFFFFFFFULL;
__device__ int get_slot(unsigned long long *sm_slots){
unsigned long long my_slots;
bool done = false;
int my_slot;
while (!done){
while ((my_slots=atomicExch(sm_slots, busy)) == busy); // wait until we get an available slot
my_slot = __ffsll(~my_slots) - 1;
if (my_slot < max_num_slots) done = true;
else atomicExch(sm_slots, my_slots);} // handle case where all slots busy, should not happen
unsigned long long my_slot_bit = 1ULL<<my_slot;
unsigned long long retval = my_slots|my_slot_bit;
assert(atomicExch(sm_slots, retval) == busy);
return my_slot;
}
__device__ void release_slot(unsigned long long *sm_slots, int slot){
unsigned long long my_slots;
while ((my_slots=atomicExch(sm_slots, busy)) == busy); // wait until slot access not busy
unsigned long long my_slot_bit = 1ULL<<slot;
unsigned long long retval = my_slots^my_slot_bit;
assert(atomicExch(sm_slots, retval) == busy);
}
__device__ int __mysmid(){
int smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}
__global__ void k(unsigned long long *sm_slots, int *temp_data){
int my_sm = __mysmid();
int my_slot = get_slot(sm_slots+my_sm);
temp_data[my_sm*max_num_slots + my_slot] = blockIdx.x;
long long start = clock64();
while (clock64()<start+DELAY_T);
assert(temp_data[my_sm*max_num_slots + my_slot] == blockIdx.x);
release_slot(sm_slots+my_sm, my_slot);
}
int main(){
// hard coding constants for Tesla V100 for demonstration purposes.
// these should instead be queried at runtime to match your GPU
const int num_sms = 80;
const int blocks_per_sm = 32;
// slots must match the number of blocks per SM, constants at top may need to be modified
assert(blocks_per_sm <= max_num_slots);
unsigned long long *d_sm_slots;
int *d_data;
cudaMalloc(&d_sm_slots, num_sms*blocks_per_sm*sizeof(unsigned long long));
cudaMalloc(&d_data, num_sms*blocks_per_sm*sizeof(int));
cudaMemset(d_sm_slots, 0, num_sms*blocks_per_sm*sizeof(unsigned long long));
k<<<123456, 1>>>(d_sm_slots, d_data);
cudaDeviceSynchronize();
if (cudaGetLastError()!=cudaSuccess) {std::cout << "failure" << std::endl; return 0;}
std::cout << "success" << std::endl;
return 0;
}
In the first method, I referred to launching W*X
blocks, but to correctly use that method it would be necessary to do an occupancy analysis. The number W*X
may need to be reduced due to the occupancy analysis. The second method as indicated in the code sample can work correctly for an arbitrary number of blocks launched.