cuda — out of memory (threads and blocks issue) --Address is out of bounds

前端 未结 2 1941
天命终不由人
天命终不由人 2021-01-28 16:39

I am using 63 registers/thread ,so (32768 is maximum) i can use about 520 threads.I am using now 512 threads in this example.

(The parallelism is in the function \"comp

2条回答
  •  盖世英雄少女心
    2021-01-28 17:24

    When i use numPointsRp>2000 it show me "out of memory"

    Now we have some real code to work with, let's compile it and see what happens. Using RowRsSize=2000 and RowRpSize=200 and compiling with the CUDA 4.2 toolchain, I get:

    nvcc -arch=sm_21 -Xcompiler="-D RowRsSize=2000 -D RowRpSize=200" -Xptxas="-v" -c -I./ kivekset.cu 
    ptxas info    : Compiling entry function '_Z15computeEHfieldsPfiS_iPN6pycuda7complexIfEES3_S2_S2_PA3_S2_S5_S3_' for 'sm_21'
    ptxas info    : Function properties for _Z15computeEHfieldsPfiS_iPN6pycuda7complexIfEES3_S2_S2_PA3_S2_S5_S3_
        122432 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    ptxas info    : Used 57 registers, 84 bytes cmem[0], 168 bytes cmem[2], 76 bytes cmem[16]
    

    The key numbers are 57 registers and 122432 bytes stack frame per thread. The occupancy calculator suggests that a block of 512 threads will have a maximum of 1 block per SM, and your GPU has 7 SM. This gives a total of 122432 * 512 * 7 = 438796288 bytes of stack frame (local memory) to run your kernel, before you have allocated a single of byte of memory for input and output using pyCUDA. On a GPU with 1Gb of memory, it isn't hard to imagine running out of memory. Your kernel has a enormous local memory footprint. Start thinking about ways to reduce it.


    As I indicated in comments, it is absolutely unclear why every thread needs a complete copy of the input data in this kernel code. It results in a gigantic local memory footprint and there seems to be absolutely no reason why the code should be written in this way. You could, I suspect, modify the kernel to something like this:

    typedef  pycuda::complex cmplx;
    typedef float fp3[3];
    typedef cmplx cp3[3];
    
    __global__  
    void computeEHfields2(
            float *Rs_mat_, int numPointsRs,
            float *Rp_mat_, int numPointsRp,
            cmplx *J_,
            cmplx *M_,
            cmplx  kp, 
            cmplx  eta,
            cmplx E[][3],
            cmplx H[][3], 
            cmplx *All )
    {
    
        fp3 * Rs_mat = (fp3 *)Rs_mat_;
        cp3 * J = (cp3 *)J_;
        cp3 * M = (cp3 *)M_;
    
        int k=threadIdx.x+blockIdx.x*blockDim.x;
        while (k

    and the main __device__ function it calls to something like this:

    __device__ void computeEvec2(
            fp3 Rs_mat[], int numPointsRs,   
            cp3 J[],
            cp3 M[],
            fp3   Rp,
            cmplx kp, 
            cmplx eta,
            cmplx *Evec,
            cmplx *Hvec, 
            cmplx *All)
    {
     ....
    }
    

    and eliminate every byte of thread local memory without changing the functionality of the computational code at all.

提交回复
热议问题