I am using 63 registers/thread ,so (32768 is maximum) i can use about 520 threads.I am using now 512 threads in this example.
(The parallelism is in the function \"comp
Using R=1000 and then
block=R/2,1,1 and grid=1,1 everything ok
If i try R=10000 and
block=R/20,1,1 and grid=20,1 ,then it show me "out of memory"
I'm not familiar with pycuda and didn't read into your code too deeply. However you have more blocks and more threads, so it will
local memory (probably the kernel's stack, it's allocated per thread),
shared memory (allocated per block), or
global memory that gets allocated based on grid
or gridDim
.
You can reduce the stack size calling
cudeDeviceSetLimit(cudaLimitStackSize, N));
(the code is for the C runtime API, but the pycuda equivalent shouldn't be too hard to find).