driver.Context.synchronize()- what else to take into consideration — -a clean-up operation failed

最后都变了- 提交于 2020-01-13 10:45:12

问题


I have this code here (modified due to the answer).

Info

32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 46 registers, 120 bytes cmem[0], 176 bytes cmem[2], 76 bytes cmem[16]

I don't know what else to take into consideration in order to make it work for different combinations of points "numPointsRs" and "numPointsRp"

When ,for example, i run the code with Rs=10000 and Rp=100000 with block=(128,1,1),grid=(200,1) its fine.

My computations:

46 registers*128threads=5888 registers .

My card has limit 32768registers,so 32768/5888=5 +some => 5 block/SM
(my card has limit 6).

With the occupancy calculator i found that using 128 threads/block gives me 42% and am in the limits of my card.

Also,the number of threads per MP is 640 (limit is 1536)

Now,if i try to use Rs=100000 and Rp=100000 (for the same threads and blocks) it gives me the message in the title,with:

cuEventDestroy failed: launch timeout

cuModuleUnload failed: launch timeout

1) I don't know/understand what else is needed to be computed.

2) I can't understand how we use/find the number of the blocks.I can see that mostly,someone puts (threads-1+points)/threads ,but that still doesn't work.

--------------UPDATED----------------------------------------------

After using driver.Context.synchronize() ,the code works for many points (1000000)!

But ,what impact has this addition to the code?(for many points the screen freezes for 1 minute or more).Should i use it or not?

--------------UPDATED2----------------------------------------------

Now,the code doesn't work again without doing anything!

Snapshot of code:

import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
import cmath
import pycuda.driver as drv
import pycuda.tools as t






#---- Initialization and passing(allocate memory and transfer data) to GPU -------------------------
Rs_gpu=gpuarray.to_gpu(Rs)
Rp_gpu=gpuarray.to_gpu(Rp)

J_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))
M_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))

Evec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
Hvec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
All_gpu=gpuarray.to_gpu(np.ones(numPointsRp).astype(np.complex64))

#-----------------------------------------------------------------------------------    
mod =SourceModule("""
#include <pycuda-complex.hpp>
#include <cmath>
#include <vector>

typedef  pycuda::complex<float> cmplx;
typedef float fp3[3];
typedef cmplx cp3[3];

__device__ __constant__ float Pi;

extern "C"{  


    __device__ void computeEvec(fp3 Rs_mat[], int numPointsRs,   
         cp3 J[],
         cp3 M[],
         fp3 Rp,
         cmplx kp, 
         cmplx eta,
         cmplx *Evec,
         cmplx *Hvec, cmplx *All)

{

            while (c<numPointsRs){

        ...                      
                c++;

                }        
        }


__global__  void computeEHfields(float *Rs_mat_, int numPointsRs,     
        float *Rp_mat_, int numPointsRp,     
    cmplx *J_,
    cmplx *M_,
    cmplx  kp, 
    cmplx  eta,
    cmplx E[][3],
    cmplx H[][3], cmplx *All )
    {

        fp3 * Rs_mat=(fp3 *)Rs_mat_;
        fp3 * Rp_mat=(fp3 *)Rp_mat_;
        cp3 * J=(cp3 *)J_;
        cp3 * M=(cp3 *)M_;


    int k=threadIdx.x+blockIdx.x*blockDim.x;

      while (k<numPointsRp)  
     {

        computeEvec( Rs_mat, numPointsRs,  J, M, Rp_mat[k], kp, eta, E[k], H[k], All );
        k+=blockDim.x*gridDim.x;

    }

}
}

""" ,no_extern_c=1,options=['--ptxas-options=-v'])


#call the function(kernel)
func = mod.get_function("computeEHfields")

func(Rs_gpu,np.int32(numPointsRs),Rp_gpu,np.int32(numPointsRp),J_gpu, M_gpu, np.complex64(kp), np.complex64(eta),Evec_gpu,Hvec_gpu, All_gpu, block=(128,1,1),grid=(200,1))


#----- get data back from GPU-----
Rs=Rs_gpu.get()
Rp=Rp_gpu.get()
J=J_gpu.get()
M=M_gpu.get()
Evec=Evec_gpu.get()
Hvec=Hvec_gpu.get()
All=All_gpu.get()

My card:

Device 0: "GeForce GTX 560"
  CUDA Driver Version / Runtime Version          4.20 / 4.10
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 1024 MBytes (1073283072 bytes)
  ( 0) Multiprocessors x (48) CUDA Cores/MP:     0 CUDA Cores   //CUDA Cores    336 => 7 MP and 48 Cores/MP

回答1:


There are quite a few issues that you have to deal with. Answer 1 provided by @njuffa is the best general solution. I'll provide more feedback based upon the limited data you have provided.

  1. PTX output of 46 registers is not the number of registers used by your kernel. PTX is an intermediate representation. The offline or JIT compiler will convert this to device code. Device code may use more or less registers. Nsight Visual Studio Edition, the Visual Profiler, and the CUDA command line profiler can all provide you the correct register count.

  2. The occupancy calculation is not simply RegistersPerSM / RegistersPerThread. Registers are allocated based upon a granularity. For CC 2.1 the granularity is 4 registers per thread per warp (128 registers). 2.x devices can actually allocate at a 2 register granularity but this can lead to fragmentation later in the kernel.

  3. In your occupancy calculation you state

My card has limit 32768registers,so 32768/5888=5 +some => 5 block/SM (my card has limit 6).

I'm not sure what 6 means. Your device has 7 SMs. The maximum blocks per SM for 2.x devices is 8 blocks per SM.

  1. You have provided an insufficient amount of code. If you provide pieces of code please provide the size of all inputs, the number of times each loop will be executed, and a description of the operations per function. Looking at the code you may be doing too many loops per thread. Without knowing the order of magnitude of the outer loop we can only guess.

  2. Given that the launch is timing out you should probably approach debugging as follows:

a. Add a line to the beginning of the code

if (blockIdx.x > 0) { return; }

Run the exact code you have in one of the previously mentioned profilers to estimate the duration of a single block. Using the launch information provided by the profiler: register per thread, shared memory, ... use the occupancy calculator in the profiler or the xls to determine the maximum number of blocks that you can run concurrently. For example, if the theoretical block occupancy is 3 blocks per SM, and the number of SMs is 7 the you can run 21 blocks at a time which for you launch is 9 waves. NOTE: this assumes equal work per thread. Change the early exit code to allow 1 wave (21 blocks). If this launch times out then you need to reduce the amount of work per thread. If this passes then calculate how many waves you have and estimate when you will timeout (2sec on windows, ? on linux).

b. If you have too many waves then reduce you have to reduce the launch configuration. Given that you index by gridDim.x and blockDim.x you can do this by passing in these dimensions as as parameters to your kernel. This will require tou to minimally change your indexing code. You will also have to pass a blockIdx.x offset. Change your host code to launch multiple kernels back to back. Since there should be no conflict you can rr launch these in multiple streams to benefit from overlap at the end of each wave.




回答2:


"launch timeout" would appear to indicate that the kernel ran too long and was killed by the watchdog timer. This can happen on GPUs that are also used for graphics output (e.g. a graphical desktop), where the task of the watchdog timer is to prevent the desktop from locking up for more than a few seconds. Best I can recall the watchdog time limit is on the order of 5 seconds or thereabouts.

At any given moment, the GPU can either run graphics, or CUDA, so the watchdog timer is needed when running a GUI to prevent the GUI from locking up for an extended period of time, which renders the machine inoperable through the GUI.

If possible, avoid using this GPU for the desktop and/or other graphics (e.g. don't run X if you are on Linux). If running without graphics isn't an option, to reduce kernel execution time to avoid hitting watchdog timer kernel termination, you will have to do less work per kernel launch, optimize the code so the kernel runs faster for the same amount of work, or deploy a faster GPU.




回答3:


To provide more inputs on @njuffa's answer, in Windows systems you can increase the launch timeout or TDR (Timeout Detection & Recovery) by following these steps:

1: Open the options in Nsight Monitor.

2: Set an appropriate value for WDDM TDR Delay

CUATION: If this value is small you may get timeout error and for higher values your screen will stay frozen until kernel finishes it's job.

source



来源:https://stackoverflow.com/questions/12263945/driver-context-synchronize-what-else-to-take-into-consideration-a-clean-u

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!