States of memory data after cuda exceptions

后端未结

关注

 1  558

闹比i

CUDA document is not clear on how memory data changes after CUDA applications throws an exception.

For example, a kernel launch(dynamic) encountered an exception (e.

相关标签:

1条回答

长发绾君心

2020-12-06 04:22

The behavior is undefined in the event of a CUDA error which corrupts the CUDA context.

This type of error is evident because it is "sticky", meaning once it occurs, every single CUDA API call will return that error, until the context is destroyed.

Non-sticky errors are cleared automatically after they are returned by a cuda API call (with the exception of cudaPeekAtLastError). Any "crashed kernel" type error (invalid access, unspecified launch failure, etc.) will be a sticky error. In your example, step 3 would (always) return an API error on the result of the cudaMemcpy call to transfer variableA from device to host, so the results of the cudaMemcpy operation are undefined and unreliable -- it is as if the cudaMemcpy operation also failed in some unspecified way.

Since the behavior of a corrupted CUDA context is undefined, there is no definition for the contents of any allocations, or in general the state of the machine after such an error.

An example of a non-sticky error might be an attempt to cudaMalloc more data than is available in device memory. Such an operation will return an out-of-memory error, but that error will be cleared after being returned, and subsequent (valid) cuda API calls can complete successfully, without returning an error. A non-sticky error does not corrupt the CUDA context, and the behavior of the cuda context is exactly the same as if the invalid operation had never been requested.

This distinction between sticky and non-sticky error is called out in many of the documented error code descriptions, for example:

non-sticky, non-cuda-context-corrupting:

cudaErrorMemoryAllocation = 2 The API call failed because it was unable to allocate enough memory to perform the requested operation.

sticky, cuda-context-corrupting:

cudaErrorMisalignedAddress = 74 The device encountered a load or store instruction on a memory address which is not aligned. The context cannot be used, so it must be destroyed (and a new one should be created). All existing device memory allocations from this context are invalid and must be reconstructed if the program is to continue using CUDA.

Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.

0 讨论(0)
发布评论:

提交评论
- 加载中...