reason why custom loop is faster? bad compiler? unsafe custom code? luck?(lucky cache hits)

后端 未结 5 1022
失恋的感觉
失恋的感觉 2021-01-16 15:08

i just started learning assembly and making some custom loop for swapping two variables using C++ \'s asm{} body with Digital-Mars compiler in C-Free 5.0

Enabled th

5条回答
  •  有刺的猬
    2021-01-16 15:39

    It's a bit hard to guess what your compiler may be doing without seeing the assembly language result it creates. With VC++ 10, I get the following results:

    time of for-loop(cycles) 155
    
    time of while-loop(cycles)  158
    
    time of custom-loop-1(cycles)   369
    
    time of custom-loop-2(cycles)  314
    

    I didn't look at the output, but my immediate guess would be that the difference between the for and while loops is just noise. Both are obviously quite a bit faster than your hand-written assembly code though.

    Edit: looking at the assembly code, I was right -- the code for the for and the while is identical. It looks like this:

            call    _clock
            mov     ecx, DWORD PTR _a$[ebp]
            cdq
            mov     ebx, edx
            mov     edx, DWORD PTR _b$[ebp]
            mov     edi, eax
            mov     esi, 200000000
    $LL2@main:
    ; Line 28
            dec     esi
    ; Line 30
            mov     eax, ecx
    ; Line 31
            mov     ecx, edx
    ; Line 32
            mov     edx, eax
            jne     SHORT $LL2@main
            mov     DWORD PTR _b$[ebp], edx
            mov     DWORD PTR _a$[ebp], ecx
    ; Line 35
            call    _clock
    

    While arguably less "clever" than your second loop, modern CPUs tend to do best with simple code. It also just has fewer instructions inside the loop (and doesn't reference memory inside the loop at all). Those aren't the sole measures of efficiency by any means, but with this simple of a loop, they're fairly indicative.

    Edit 2:

    Just for fun, I wrote a new version that adds the triple-XOR swap, as well as one using the CPU's xchg instruction (just because that's how I'd probably write it by hand if I didn't care much about speed, etc.) Though Intel/AMD generally recommend against the more complex instructions, it doesn't seem to cause a problem -- it seems to be coming out at least as fast as anything else:

     time of for-loop(cycles) 156
    
     time of while-loop(cycles)  160
    
     time swap between register and cache  284
    
     time to swap using add/sub:  308
    
     time to swap using xchg:  155
    
     time to swap using triple-xor  233
    

    Source:

    // Note: updated source -- it was just too ugly to live. Same results though.
    #include
    #include
    #include 
    #include 
    #include 
    #include 
    
    namespace { 
        int a, b;
        const int loops = 200000000;
    }
    
    template 
    struct timer {
        timer(std::string const &label) { 
            clock_t t1 = clock();
            swapper()();
            clock_t t2 = clock();
            std::ostringstream buffer;
            buffer << "Time for swap using " << label;
            std::cout << std::left << std::setw(30) << buffer.str() << " = " << (t2-t1) << "\n";
        }
    };
    
    struct for_loop {
        void operator()() {
            int temp;
            for(int i=0;i("for loop");
        timer("while loop");
        timer("reg<->mem");
        timer("add/sub");
        timer("xchg");
        timer("triple xor");
        return 0;
    }
    

    Bottom line: at least for this trivial of a task, you're not going to beat a decent compiler by enough to care about (and probably not at all, except possibly in terms of minutely smaller code).

提交回复
热议问题