Given an array of size 3n of the form
[x1, x2, x3... xn, y1, y2, y3... yn, z1, z2, z3... zn]
Convert it to [x1, y1, z1, x2, y2, z2, .
Since David does not seem interested in writing it down (well obviously he is interested, see the other answer :), I will use his reference to arrive at an algorithm for the case with 3 partitions.
First note that if we can solve the problem efficiently for some m < n using an algorithm A, we can rearrange the array so that we can apply A and are then left with a smaller subproblem. Say the original array is
x1 .. xm x{m+1}.. xn y1 .. ym y{m+1} .. yn z1 .. zm z{m+1} .. zn
We want to rearrange it to
x1 .. xm y1 .. ym z1 .. zm x{m+1} .. xn y{m+1} .. yn z{m+1} .. zn
This is basically a transformation of the pattern AaBbCc
to ABCabc
where A, B, C and a, b, c have the same lengths, respectively. We can achieve that through a series of reversals. Let X' denote the reversal of string X here:
AaBbCc
-> Aa(BbCc)' = Aac'C'b'B'
-> Aac'(C'b')'B' = Aac'bCB'
-> A(ac'bCB')' = ABC'b'ca'
-> ABCb'ca'
-> ABC(b'ca')' = ABCac'b
-> ABCa(c'b)' = ABCab'c
-> ABCabc
There's probably a shorter way, but this is still just a constant number of operations, so it takes only linear time. One could use a more sophisticated algorithm here to implement some of the cyclic shifts, but that's just an optimization.
Now we can solve the two partitions of our array recursively and we're done.
The question remains, what would be a nice m that allows us to solve the left part easily?
To figure this out, we need to realize that what we want to implement is a particular permutation P of the array indices. Every permutation can be decomposed into a set of cycles a0 -> a1 -> ... -> a{k-1} -> a0
, for which we have P(ai) = a{(i + 1) % k}. It is easy to process such a cycle in-place, the algorithm is outlined on Wikipedia.
Now the problem is that after you completed processing one of the cycle, to find an element that is part of a cycle you have not yet processed. There is no generic solution for this, but for some particular permutations there are nice formulas that describe what exactly the positions are that are part of the different cycles.
For your problems, you just choose m = (5^(2k) - 1)/3, such that m < n and k is maximum. A sequence of elements that are part of all the different cycles is 5^0, 5^1, ..., 5^{k-1}. You can use those to implement the cycle-leader algorithm on the left part of the array (after the shifting) in O(m).
We solve the leftover right part recursively and get an algorithm to solve the problem in time
T(n) = O(m) + T(n - m)
and since m >= Omega(n), we get T(n) = O(n).
This answer is based on work by Peiyush Jain (whose bibliography is woefully incomplete, but I don't feel like taking the time to straighten out the history of the in-place transposition problem). Observe that 3 is a primitive root of 25 = 5^2, since
>>> len(set(pow(3,n,25)for n in range(25)))
20
and 20 is Euler's totient of 25. By Jain's Theorem 1, a classic result in number theory, 3 is a primitive root for all 5^k.
When the array has length 3n, the new position of the element at position k*n + j is 3*j + k. In general, the new position of i (except for the last element) is (i*n) % (3*n - 1). Note that n is the multiplicative inverse of 3 modulo 3*n - 1, so 3 is a primitive root if and only if n is.
Jain's observation, in this case, is that, if 3*n - 1 is a power of 5, then the permutation above has log_5 (3*n - 1) + 1 distinct cycles, led by 5^k for k from 0 to log_5 (3*n - 1). (This is more or less the definition of primitive root.) For each cycle, all we have to do is move the leader, move the element displaced by the leader, move the element displaced by the element displaced by the leader, etc., until we return to the leader.
For other array sizes, break the array into O(log n) implicit subarrays of lengths 3 and one plus powers of 5 that are divisible by 3: 6, 126, 3126, 78126, etc. Do a series of rotations, decreasing geometrically in size, to get the subarrays contiguous, then run the above algorithm.
If you actually implement this, please benchmark it. I did for the base case of Jain's algorithm (3^n - 1, pairs instead of triples) and found that, on my machine the O(n log n)-time algorithm was faster for non-galactic input sizes. YMMV of course.