I am implementing several datastructures and one primitive I want to use is the following: I have a memory chunk A[N] (it has a variable length, but I take 100 for my examples)
Detailed explanation of this case is given in the first answer by R.. I've nothing to add here.
The easiest approach would be always rotate whole array. This also moves some unneeded elements from destination range, but since in this case K > N/2
, this does not make number of operations more then twice as necessary.
To rotate the array, use cycle leader algorithm: take first element of the array (A[0]) and copy it to destination position; previous contents of this position move again to its proper position; continue until some element is moved to the starting position.
Continue applying the cycle leader algorithm for next starting positions: A[1], A[2], ..., A[GCD(N,d) - 1], where d
is the distance between source and destination.
After GCD(N,d)
steps, all elements are on their proper positions. This works because:
GCD(N,d)
).N / GCD(N,d)
- because d / GCD(N,d)
and N / GCD(N,d)
are relatively prime.This algorithm is simple and it moves each element exactly once. It may be made thread-safe (if we skip the write step unless inside the destination range). Other multi-threading-related advantage is that each element may have only two values - value before "move" and value after "move" (no temporary in-between values possible).
But it does not always have optimal performance. If element_size * GCD(N,d)
is comparable to cache line size, we might take all GCD(N,d)
starting positions and process them together. If this value is too large, we can split starting positions into several contiguous segments to lower space requirements back to O(1).
The problem is when element_size * GCD(N,d)
is much smaller than cache line size. In this case we get a lot of cache misses and performance degrades. gusbro's idea to temporarily swap array pieces with some "swap" region (of size d
) suggests more efficient algorithm for this case. It may be optimized more if we use "swap" region, that fits in the cache, and copy non-overlapped areas with memcpy.
One more algorithm. It does not overwrite elements that are not in the destination range. And it is cache-friendly. The only disadvantage is: it moves each element exactly twice.
The idea is to move two pointers in opposite directions and swap pointed elements. There is no problem with overlapping regions because overlapping regions are just reversed. After first pass of this algorithm, we have all source elements moved to destination range, but in reversed order. So second pass should reverse destination range:
for (d = dst_start, s = src_end - 1;
d != dst_end;
d = (d + 1) % N, s = (s + N - 1) % N)
swap(s, d);
for (d = dst_start, s = dst_end - 1;
d != dst_end;
d = (d + 1) % N, s = (s + N - 1) % N)
swap(s, d);