How to execute atomic write in CUDA?

问题

First of all I cannot find reliable source whether the write is atomic in CUDA or not. For example Is global memory write considered atomic in CUDA? touches this subject but the last remark shows we are not talking about same atomic notion. Having the code:

global_mem[0] = pick_at_random_from(1, 2);
shared_mem[0] = pick_at_random_from(1, 2);

executed by gazillion of threads "atomic" means in both cases the content will be 1 or 2 and it is guaranteed nothing else can show up (like 3). Atomic means integrity.

But as I understand it, CUDA does not guarantee it, so when I run this code I can potentially get value 3? If it really the case, how to perform atomic write? There is atomicExch but it is an overkill -- it does more than it is needed.

Atomic functions I already checked: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions

回答1:

For a write operation in each of 2 different threads in CUDA, if:

the writes are to the same location (address)
that address is naturally aligned for the size of the write
the size of the write operation is the same between each of the two threads (and is of size 1, 2, 4, or 8 bytes)

then you are guaranteed to get one of the values written by those two threads, and not any other value, considering the data type size that was written. This is provided so long as the write is done by a single SASS instruction. The correctness here is provided by current CUDA hardware, not necessarily the compiler, the CUDA programming model, and/or the C++ standard to which CUDA adheres.

This is directly extendable to any number of threads that meet the above conditions.

This assumes no other threads are doing "anything else" with respect to the written locations (i.e. they are not writing a different size quantity to that location, or any overlapping location, or of some other alignment).

Which actual value will end up in that location is generally undefined (except that it will be one and only one of the written values, and not anything else) unless the programmer enforces some ordering on the operations.

When writing vector quantities or structures in C/C++, care should be taken to ensure that the underlying write (store) instruction in SASS code references the appropriate size. The comments above when referring to write operations are referring to the writes as issued by the SASS code. Generally speaking, I don't expect much difference between that interpretation and "writes from C/C++ code" using POD data types. But structures could possibly be broken into multiple transactions of a smaller size, in which case the above statements would be abrogated. Nevertheless, it's possible with appropriate programming practices (e.g. careful use of vector types) in C/C++ to ensure that up to 8 byte writes will be used if relevant.

来源：https://stackoverflow.com/questions/52848426/how-to-execute-atomic-write-in-cuda

标签

cuda

atomic