问题
I recently encountered a simple notion in programming but i stuck when i tried to implement it in cuda.
Suppose that i have thousands of elements and i want to find the closest pair between them. I use atomicMIN
in global memory (suppose that we dont want to reduce) so if the distance which is calculated by each thread is smaller than the distance stored in the global variable the atomicCAS will replace it with the smaller value. For example i have the global variable float gbl_min_dist
To do this I use the following code:
__device__ inline float atomicMin(float *addr, float value){
float old = *addr, assumed;
if( old <= value ) return old;
do{
assumed = old;
old = atomicCAS((unsigned int*)addr, __float_as_int(assumed), __float_as_int(value));
}while( old!=assumed );
return old;
}
Suppose now that we want to store the index of the two points that were closer together and for which the atomicMIN
has successfully replaced the old minimum distance with the one calculated by those two points. What I mean is that I only want to store the indeces of the two points that currently have the smaller distance if and only if its distance has just been successfully swaped in the global variable
typedef struct {float gbl_min_dist,
unsigned int point1,
unsigned int point2;} global_closest_points;
So here, when a thread executes the atomicMIN
, if the value that is proposed by that tread to be compared is swapped in the gbl_min_dist
then i also need to swap the p1, p2 with the values from the thread. If the gbl_min_dist
is not swapped then I dont want to store the points cause this would give wrong points but correct minimum distance.
Is there any return value to check if atomicCAS
has made the swap?
Any ideas on how to implement this within the atomicMIN
?
Thanks in advance
回答1:
- You could use a critical section to have each thread have exclusive access to the data while it is updating it.
- Since your
gbl_min_dist
is a 32-bit value, if you can figure out a way to squeeze bothp1
andp2
into a single 32-bit value, you could use an approach like the custom atomics answer I gave here.
If you simply use whether or not the atomicCAS
made the first swap to condition additional code to update p1
and p2
, I think it's still possible to have a race condition that allows your data to get out of sync between thread updates.
回答2:
You could construct a critical section to atomically update the min value and corresponding point indices. The following link gives a example on how to build the CS with atomicCAS()
and atomicExch()
.
https://github.com/ArchaeaSoftware/cudahandbook/blob/master/memory/spinlockReduction.cu
On the other hand, I would suggest replace the atomic min operations by parallel reduction algorithm. That may improve the performance.
回答3:
The way I suggest is, rather than depend on a stored distance, recompute it any time it is critical that the stored points may have changed:
typedef struct {
unsigned int point1,
unsigned int point2;
}
global_closest_points, local_closest_points, temp_c_p;
local_dist = distance(local_closest_points.point1, local_closest_points.point2);
temp_c_p = global_closest_points;
while (local_dist < distance(temp_c_p.point1, temp_c_p.point2)
temp_c_p = atomicCAS(&global_closest_points, temp_c_p, local_closest_points);
Old habits were, save rather than recompute. But with modern processors, that's often not optimal. On CUDA, an atomic update to global memory takes more time than computing hundreds of double-precision distances.
来源:https://stackoverflow.com/questions/19363066/how-to-use-atomiccas-for-multiple-variables-with-conditionals-in-cuda