I have these indexes:
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,etc...
Which are indexes of nodes in a matrix (including diagonal elements):
In my case (a CUDA kernel implemented in standard C), I use zero-based indexing (and I want to exclude the diagonal) so I needed to make a few adjustments:
// idx is still one-based
unsigned long int idx = blockIdx.x * blockDim.x + threadIdx.x + 1; // CUDA kernel launch parameters
// but the coordinates are now zero-based
unsigned long int x = ceil(sqrt((2.0 * idx) + 0.25) - 0.5);
unsigned long int y = idx - (x - 1) * x / 2 - 1;
Which results in:
[0]: (1, 0)
[1]: (2, 0)
[2]: (2, 1)
[3]: (3, 0)
[4]: (3, 1)
[5]: (3, 2)
I also re-derived the formula of Flórez-Rueda y Moreno 2001 and arrived at:
unsigned long int x = floor(sqrt(2.0 * pos + 0.25) + 0.5);
CUDA Note: I tried everything I could think of to avoid using double-precision math, but the single-precision sqrt
function in CUDA is simply not precise enough to convert positions greater than 121 million or so to x, y coordinates (when using 1,024 threads per block and indexing only along 1 block dimension). Some articles have employed a "correction" to bump the result in a particular direction, but this inevitably falls apart at a certain point.
Not optimized at all :
int j = idx;
int i = 1;
while(j > i) {
j -= i++;
}
Optimized :
int i = std::ceil(std::sqrt(2 * idx + 0.25) - 0.5);
int j = idx - (i-1) * i / 2;
And here is the demonstration:
You're looking for i such that :
sumRange(1, i-1) < idx && idx <= sumRange(1, i)
when sumRange(min, max) sum integers between min and max, both inxluded. But since you know that :
sumRange(1, i) = i * (i + 1) / 2
Then you have :
idx <= i * (i+1) / 2
=> 2 * idx <= i * (i+1)
=> 2 * idx <= i² + i + 1/4 - 1/4
=> 2 * idx + 1/4 <= (i + 1/2)²
=> sqrt(2 * idx + 1/4) - 1/2 <= i