C - fastest method to swap two memory blocks of equal size?

前端未结

关注

 9  2599

What is the fastest way to swap two non-overlapping memory areas of equal size? Say, I need to swap (t_Some *a) with (t_Some *b). Considering space-tim

相关标签:

9条回答

眼角桃花

2021-02-20 06:12
The fastest way to move a block of memory is going to be memcpy() from <string.h>. If you memcpy() from a to temp, memmove() from b to a, then memcpy() from temp to b, you’ll have a swap that uses the optimized library routines, which the compiler probably inlines. You wouldn’t want to copy the entire block at once, but in vector-sized chunks.

In practice, if you write a tight loop, the compiler can probably tell that you’re swapping every element of the arrays and optimize accordingly. On most modern CPUs, you want to generate vector instructions. It might be able to generate faster code if you make sure all three buffers are aligned.

However, what you really want to do is make things easier for the optimizer. Take this program:
```
#include <stddef.h>

void swap_blocks_with_loop( void* const a, void* const b, const size_t n )
{
  unsigned char* p;
  unsigned char* q;
  unsigned char* const sentry = (unsigned char*)a + n;

  for ( p = a, q = b; p < sentry; ++p, ++q ) {
     const unsigned char t = *p;
     *p = *q;
     *q = t;
  }
}
```
If you translate that into machine code as literally written, it’s a terrible algorithm, copying one byte at a time, doing two increments per iteration, and so on. In practice, though, the compiler sees what you’re really trying to do.

In clang 5.0.1 with -std=c11 -O3, it produces (in part) the following inner loop on x86_64:
```
.LBB0_7:                                # =>This Inner Loop Header: Depth=1
        movups  (%rcx,%rax), %xmm0
        movups  16(%rcx,%rax), %xmm1
        movups  (%rdx,%rax), %xmm2
        movups  16(%rdx,%rax), %xmm3
        movups  %xmm2, (%rcx,%rax)
        movups  %xmm3, 16(%rcx,%rax)
        movups  %xmm0, (%rdx,%rax)
        movups  %xmm1, 16(%rdx,%rax)
        movups  32(%rcx,%rax), %xmm0
        movups  48(%rcx,%rax), %xmm1
        movups  32(%rdx,%rax), %xmm2
        movups  48(%rdx,%rax), %xmm3
        movups  %xmm2, 32(%rcx,%rax)
        movups  %xmm3, 48(%rcx,%rax)
        movups  %xmm0, 32(%rdx,%rax)
        movups  %xmm1, 48(%rdx,%rax)
        addq    $64, %rax
        addq    $2, %rsi
        jne     .LBB0_7
```
Whereas gcc 7.2.0 with the same flags also vectorizes, unrolling the loop less:
```
.L7:
        movdqa  (%rcx,%rax), %xmm0
        addq    $1, %r9
        movdqu  (%rdx,%rax), %xmm1
        movaps  %xmm1, (%rcx,%rax)
        movups  %xmm0, (%rdx,%rax)
        addq    $16, %rax
        cmpq    %r9, %rbx
        ja      .L7
```
Convincing the compiler to produce instructions that work on a single word at a time, instead of vectorizing the loop, is the opposite of what you want!
0 讨论(0)
发布评论:

提交评论
- 加载中...
隐瞒了意图╮

2021-02-20 06:17
Thought I'd share my simple solution I've been using for ages on micro controllers without drama.
```
#define swap(type, x, y) { type _tmp; _tmp = x; x = y; y = _tmp; }
```
OK... it creates a stack variable but it's usually for uint8_t, uint32_t, float, double, etc. However it should work on structures just as well.

The compiler should be smart enough to see the stack variable can be swapped for a register when the size of the type permits.

Really only meant for small types... which will probably suit 99% of cases.

Could also use "auto" instead of passing the type... but I like to be more flexible and I suppose "auto" could be passed as the type.

examples...
```
swap(uint8_t, var1, var2) 
swap(float, fv1, fv2)
swap(uint32_t, *p1, *p2) // will swap the contents as p1 and p2 are pointers
swap(auto, var1, var2) // should work fine as long as var1 and var2 are same type
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

清歌不尽

2021-02-20 06:22

Word writes will be the fastest. However, both block size and alignment need to be considered. In practice things are usually aligned sensibly, but you shouldn't count on it. memcpy() safely handles everything and may be specialized (built-in) for constant sizes within reason.

Here is a portable solution that works reasonably well in most cases.

static void swap_byte(void* a, void* b, size_t count)
{
    char* x = (char*) a;
    char* y = (char*) b;

    while (count--) {
        char t = *x; *x = *y; *y = t;
        x += 1;
        y += 1;
    }
}

static void swap_word(void* a, void* b, size_t count)
{
    char* x = (char*) a;
    char* y = (char*) b;
    long t[1];

    while (count--) {
        memcpy(t, x, sizeof(long));
        memcpy(x, y, sizeof(long));
        memcpy(y, t, sizeof(long));
        x += sizeof(long);
        y += sizeof(long);
    }
}

void memswap(void* a, void* b, size_t size)
{
    size_t words = size / sizeof(long);
    size_t bytes = size % sizeof(long);
    swap_word(a, b, words);
    a = (char*) a + words * sizeof(long);
    b = (char*) b + words * sizeof(long);
    swap_byte(a, b, bytes);
}

0 讨论(0)

上一页 1 2