cuda — out of memory (threads and blocks issue) --Address is out of bounds

前端 未结 2 1938
天命终不由人
天命终不由人 2021-01-28 16:39

I am using 63 registers/thread ,so (32768 is maximum) i can use about 520 threads.I am using now 512 threads in this example.

(The parallelism is in the function \"comp

相关标签:
2条回答
  • 2021-01-28 17:08

    Using R=1000 and then

    block=R/2,1,1 and grid=1,1 everything ok

    If i try R=10000 and

    block=R/20,1,1 and grid=20,1 ,then it show me "out of memory"

    I'm not familiar with pycuda and didn't read into your code too deeply. However you have more blocks and more threads, so it will

    • local memory (probably the kernel's stack, it's allocated per thread),

    • shared memory (allocated per block), or

    • global memory that gets allocated based on grid or gridDim.

    You can reduce the stack size calling

    cudeDeviceSetLimit(cudaLimitStackSize, N));
    

    (the code is for the C runtime API, but the pycuda equivalent shouldn't be too hard to find).

    0 讨论(0)
  • 2021-01-28 17:24

    When i use numPointsRp>2000 it show me "out of memory"

    Now we have some real code to work with, let's compile it and see what happens. Using RowRsSize=2000 and RowRpSize=200 and compiling with the CUDA 4.2 toolchain, I get:

    nvcc -arch=sm_21 -Xcompiler="-D RowRsSize=2000 -D RowRpSize=200" -Xptxas="-v" -c -I./ kivekset.cu 
    ptxas info    : Compiling entry function '_Z15computeEHfieldsPfiS_iPN6pycuda7complexIfEES3_S2_S2_PA3_S2_S5_S3_' for 'sm_21'
    ptxas info    : Function properties for _Z15computeEHfieldsPfiS_iPN6pycuda7complexIfEES3_S2_S2_PA3_S2_S5_S3_
        122432 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    ptxas info    : Used 57 registers, 84 bytes cmem[0], 168 bytes cmem[2], 76 bytes cmem[16]
    

    The key numbers are 57 registers and 122432 bytes stack frame per thread. The occupancy calculator suggests that a block of 512 threads will have a maximum of 1 block per SM, and your GPU has 7 SM. This gives a total of 122432 * 512 * 7 = 438796288 bytes of stack frame (local memory) to run your kernel, before you have allocated a single of byte of memory for input and output using pyCUDA. On a GPU with 1Gb of memory, it isn't hard to imagine running out of memory. Your kernel has a enormous local memory footprint. Start thinking about ways to reduce it.


    As I indicated in comments, it is absolutely unclear why every thread needs a complete copy of the input data in this kernel code. It results in a gigantic local memory footprint and there seems to be absolutely no reason why the code should be written in this way. You could, I suspect, modify the kernel to something like this:

    typedef  pycuda::complex<float> cmplx;
    typedef float fp3[3];
    typedef cmplx cp3[3];
    
    __global__  
    void computeEHfields2(
            float *Rs_mat_, int numPointsRs,
            float *Rp_mat_, int numPointsRp,
            cmplx *J_,
            cmplx *M_,
            cmplx  kp, 
            cmplx  eta,
            cmplx E[][3],
            cmplx H[][3], 
            cmplx *All )
    {
    
        fp3 * Rs_mat = (fp3 *)Rs_mat_;
        cp3 * J = (cp3 *)J_;
        cp3 * M = (cp3 *)M_;
    
        int k=threadIdx.x+blockIdx.x*blockDim.x;
        while (k<numPointsRp)  
        {
            fp3 * Rp_mat = (fp3 *)(Rp_mat_+k);
            computeEvec2( Rs_mat, numPointsRs, J, M, *Rp_mat, kp, eta, E[k], H[k], All );
            k+=blockDim.x*gridDim.x;
        }
    }
    

    and the main __device__ function it calls to something like this:

    __device__ void computeEvec2(
            fp3 Rs_mat[], int numPointsRs,   
            cp3 J[],
            cp3 M[],
            fp3   Rp,
            cmplx kp, 
            cmplx eta,
            cmplx *Evec,
            cmplx *Hvec, 
            cmplx *All)
    {
     ....
    }
    

    and eliminate every byte of thread local memory without changing the functionality of the computational code at all.

    0 讨论(0)
提交回复
热议问题