C move memory parts inplace

前端 未结 6 827
轮回少年
轮回少年 2021-02-07 09:52

I am implementing several datastructures and one primitive I want to use is the following: I have a memory chunk A[N] (it has a variable length, but I take 100 for my examples)

6条回答
  •  粉色の甜心
    2021-02-07 10:24

    Case 1: Source overlaps with destination at most in a single contiguous region, which is smaller than whole array

    Detailed explanation of this case is given in the first answer by R.. I've nothing to add here.

    Case 2: Either source overlaps with destination in two contiguous regions or we rotate whole array

    The easiest approach would be always rotate whole array. This also moves some unneeded elements from destination range, but since in this case K > N/2, this does not make number of operations more then twice as necessary.

    To rotate the array, use cycle leader algorithm: take first element of the array (A[0]) and copy it to destination position; previous contents of this position move again to its proper position; continue until some element is moved to the starting position.

    Continue applying the cycle leader algorithm for next starting positions: A[1], A[2], ..., A[GCD(N,d) - 1], where d is the distance between source and destination.

    After GCD(N,d) steps, all elements are on their proper positions. This works because:

    1. Positions 0, 1, ..., GCD(N,d) - 1 belong to different cycles - because all these numbers are different (modulo GCD(N,d)).
    2. Each cycle has length N / GCD(N,d) - because d / GCD(N,d) and N / GCD(N,d) are relatively prime.

    This algorithm is simple and it moves each element exactly once. It may be made thread-safe (if we skip the write step unless inside the destination range). Other multi-threading-related advantage is that each element may have only two values - value before "move" and value after "move" (no temporary in-between values possible).

    But it does not always have optimal performance. If element_size * GCD(N,d) is comparable to cache line size, we might take all GCD(N,d) starting positions and process them together. If this value is too large, we can split starting positions into several contiguous segments to lower space requirements back to O(1).

    The problem is when element_size * GCD(N,d) is much smaller than cache line size. In this case we get a lot of cache misses and performance degrades. gusbro's idea to temporarily swap array pieces with some "swap" region (of size d) suggests more efficient algorithm for this case. It may be optimized more if we use "swap" region, that fits in the cache, and copy non-overlapped areas with memcpy.


    One more algorithm. It does not overwrite elements that are not in the destination range. And it is cache-friendly. The only disadvantage is: it moves each element exactly twice.

    The idea is to move two pointers in opposite directions and swap pointed elements. There is no problem with overlapping regions because overlapping regions are just reversed. After first pass of this algorithm, we have all source elements moved to destination range, but in reversed order. So second pass should reverse destination range:

    for (d = dst_start, s = src_end - 1;
         d != dst_end;
         d = (d + 1) % N, s = (s + N - 1) % N)
      swap(s, d);
    
    for (d = dst_start, s = dst_end - 1;
         d != dst_end;
         d = (d + 1) % N, s = (s + N - 1) % N)
      swap(s, d);
    

提交回复
热议问题