I\'ve read the article on Wikipedia on the Duff\'s device, and I don\'t get it. I am really interested, but I\'ve read the explanation there a couple of times and I still do
There are two key things to Duff's device. First, which I suspect is the easier part to understand, the loop is unrolled. This trades larger code size for more speed by avoiding some of the overhead involved in checking whether the loop is finished and jumping back to the top of the loop. The CPU can run faster when it's executing straight-line code instead of jumping.
The second aspect is the switch statement. It allows the code to jump into the middle of the loop the first time through. The surprising part to most people is that such a thing is allowed. Well, it's allowed. Execution starts at the calculated case label, and then it falls through to each successive assignment statement, just like any other switch statement. After the last case label, execution reaches the bottom of the loop, at which point it jumps back to the top. The top of the loop is inside the switch statement, so the switch is not re-evaluated anymore.
The original loop is unwound eight times, so the number of iterations is divided by eight. If the number of bytes to be copied isn't a multiple of eight, then there are some bytes left over. Most algorithms that copy blocks of bytes at a time will handle the remainder bytes at the end, but Duff's device handles them at the beginning. The function calculates count % 8
for the switch statement to figure what the remainder will be, jumps to the case label for that many bytes, and copies them. Then the loop continues to copy groups of eight bytes.
Here's a non-detailed explanation which is what I feel to be the crux of Duff's device:
The thing is, C is basically a nice facade for assembly language (PDP-7 assembly to be specific; if you studied that you would see how striking the similarities are). And, in assembly language, you don't really have loops - you have labels and conditional-branch instructions. So the loop is just a part of the overall sequence of instructions with a label and a branch somewhere:
instruction
label1: instruction
instruction
instruction
instruction
jump to label1 some condition
and a switch instruction is branching/jumping ahead somewhat:
evaluate expression into register r
compare r with first case value
branch to first case label if equal
compare r with second case value
branch to second case label if equal
etc....
first_case_label:
instruction
instruction
second_case_label:
instruction
instruction
etc...
In assembly it's easily conceivable how to combine these two control structures, and when you think of it that way, their combination in C doesn't seem so weird anymore.
Here is a working example for 64-bit memcpy with Duff's Device:
#include <iostream>
#include <memory>
inline void __memcpy(void* to, const void* from, size_t count)
{
size_t numIter = (count + 56) / 64; // gives the number of iterations; bit shift actually, not division
size_t rest = count & 63; // % 64
size_t rest7 = rest&7;
rest -= rest7;
// Duff's device with zero case handled:
switch (rest)
{
case 0: if (count < 8)
break;
do { *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
case 56: *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
case 48: *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
case 40: *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
case 32: *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
case 24: *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
case 16: *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
case 8: *(((unsigned long long*&)to)++) = *(((unsigned long long*&)from)++);
} while (--numIter > 0);
}
switch (rest7)
{
case 7: *(((unsigned char*)to)+6) = *(((unsigned char*)from)+6);
case 6: *(((unsigned short*)to)+2) = *(((unsigned short*)from)+2); goto case4;
case 5: *(((unsigned char*)to)+4) = *(((unsigned char*)from)+4);
case 4: case4: *((unsigned long*)to) = *((unsigned long*)from); break;
case 3: *(((unsigned char*)to)+2) = *(((unsigned char*)from)+2);
case 2: *((unsigned short*)to) = *((unsigned short*)from); break;
case 1: *((unsigned char*)to) = *((unsigned char*)from);
}
}
void main()
{
static const size_t NUM = 1024;
std::unique_ptr<char[]> str1(new char[NUM+1]);
std::unique_ptr<char[]> str2(new char[NUM+1]);
for (size_t i = 0 ; i < NUM ; ++ i)
{
size_t idx = (i % 62);
if (idx < 26)
str1[i] = 'a' + idx;
else
if (idx < 52)
str1[i] = 'A' + idx - 26;
else
str1[i] = '0' + idx - 52;
}
for (size_t i = 0 ; i < NUM ; ++ i)
{
memset(str2.get(), ' ', NUM);
__memcpy(str2.get(), str1.get(), i);
if (memcmp(str1.get(), str2.get(), i) || str2[i] != ' ')
{
std::cout << "Test failed for i=" << i;
}
}
return;
}
It handles zero-length case (in original Duff's Device there is assumption num>0). Function main() contains simple test cases for __memcpy.
The point of duffs device is to reduce the number of comparisons done in a tight memcpy implementation.
Suppose you want to copy 'count' bytes from a to b, the straight forward approach is to do the following:
do {
*a = *b++;
} while (--count > 0);
How many times do you need to compare count to see if it's a above 0? 'count' times.
Now, the duff device uses a nasty unintentional side effect of a switch case which allows you to reduce the number of comparisons needed to count / 8.
Now suppose you want to copy 20 bytes using duffs device, how many comparisons would you need? Only 3, since you copy eight bytes at a time except the last first one where you copy just 4.
UPDATED: You don't have to do 8 comparisons/case-in-switch statements, but it's reasonable a trade-off between function size and speed.
This is an answer I posted to another question about Duff's Device that got some upvaotes before the question was closed as a duplicate. I think it provides a bit of valuable context here on why you should avoid this construct.
"This is Duff's Device. It's a method of unrolling loops that avoids having to add a secondary fix-up loop to deal with times when the number of loop iteration isn't know to be an exact multiple of the unrolling factor.
Since most answers here seem to be generally positive about it I'm going to highlight the downsides.
With this code a compiler is going to struggle to apply any optimization to the loop body. If you just wrote the code as a simple loop a modern compiler should be able to handle the unrolling for you. This way you maintain readability and performance and have some hope of other optimizations being applied to the loop body.
The Wikipedia article referenced by others even says when this 'pattern' was removed from the Xfree86 source code performance actually improved.
This outcome is typical of blindly hand optimizing any code you happen to think might need it. It prevents the compiler from doing its job properly, makes your code less readable and more prone to bugs and typically slows it down. If you were doing things the right way in the first place, i.e. writing simple code, then profiling for bottlenecks, then optimizing, you'd never even think to use something like this. Not with a modern CPU and compiler anyway.
It's fine to understand it, but I'd be surprised if you ever actually use it."