reason why custom loop is faster? bad compiler? unsafe custom code? luck?(lucky cache hits)

后端未结

关注

 5  1027

失恋的感觉 2021-01-16 15:08

i just started learning assembly and making some custom loop for swapping two variables using C++ \'s asm{} body with Digital-Mars compiler in C-Free 5.0

Enabled th

5条回答

有刺的猬 (楼主)

2021-01-16 15:39

It's a bit hard to guess what your compiler may be doing without seeing the assembly language result it creates. With VC++ 10, I get the following results:

time of for-loop(cycles) 155

time of while-loop(cycles)  158

time of custom-loop-1(cycles)   369

time of custom-loop-2(cycles)  314

I didn't look at the output, but my immediate guess would be that the difference between the for and while loops is just noise. Both are obviously quite a bit faster than your hand-written assembly code though.

Edit: looking at the assembly code, I was right -- the code for the for and the while is identical. It looks like this:

        call    _clock
        mov     ecx, DWORD PTR _a$[ebp]
        cdq
        mov     ebx, edx
        mov     edx, DWORD PTR _b$[ebp]
        mov     edi, eax
        mov     esi, 200000000
$LL2@main:
; Line 28
        dec     esi
; Line 30
        mov     eax, ecx
; Line 31
        mov     ecx, edx
; Line 32
        mov     edx, eax
        jne     SHORT $LL2@main
        mov     DWORD PTR _b$[ebp], edx
        mov     DWORD PTR _a$[ebp], ecx
; Line 35
        call    _clock

While arguably less "clever" than your second loop, modern CPUs tend to do best with simple code. It also just has fewer instructions inside the loop (and doesn't reference memory inside the loop at all). Those aren't the sole measures of efficiency by any means, but with this simple of a loop, they're fairly indicative.

Edit 2:

Just for fun, I wrote a new version that adds the triple-XOR swap, as well as one using the CPU's xchg instruction (just because that's how I'd probably write it by hand if I didn't care much about speed, etc.) Though Intel/AMD generally recommend against the more complex instructions, it doesn't seem to cause a problem -- it seems to be coming out at least as fast as anything else:

 time of for-loop(cycles) 156

 time of while-loop(cycles)  160

 time swap between register and cache  284

 time to swap using add/sub:  308

 time to swap using xchg:  155

 time to swap using triple-xor  233

Source:

// Note: updated source -- it was just too ugly to live. Same results though.
#include
#include
#include 
#include 
#include 
#include 

namespace { 
    int a, b;
    const int loops = 200000000;
}

template 
struct timer {
    timer(std::string const &label) { 
        clock_t t1 = clock();
        swapper()();
        clock_t t2 = clock();
        std::ostringstream buffer;
        buffer << "Time for swap using " << label;
        std::cout << std::left << std::setw(30) << buffer.str() << " = " << (t2-t1) << "\n";
    }
};

struct for_loop {
    void operator()() {
        int temp;
        for(int i=0;i("for loop");
    timer("while loop");
    timer("reg<->mem");
    timer("add/sub");
    timer("xchg");
    timer("triple xor");
    return 0;
}

Bottom line: at least for this trivial of a task, you're not going to beat a decent compiler by enough to care about (and probably not at all, except possibly in terms of minutely smaller code).

0 讨论(0)

查看其它5个回答