reason why custom loop is faster? bad compiler? unsafe custom code? luck?(lucky cache hits)

后端 未结 5 1028
失恋的感觉
失恋的感觉 2021-01-16 15:08

i just started learning assembly and making some custom loop for swapping two variables using C++ \'s asm{} body with Digital-Mars compiler in C-Free 5.0

Enabled th

相关标签:
5条回答
  • 2021-01-16 15:16

    It's likely due the fact that the compiler fails to make it register-operands, working on indirect (address) operands instead.

    Switch compilers <-- this is your best optimization.

    Update I have gone through the trouble of translating the the same program gcc intel inline assembly: test.c. It clearly shows how the for-loop and and-while loop are vastly superior to the handwritten assembly.


    That said, with Digital Mars, the following is faster:

    __asm
    {
        xor ecx,j     //init of loop range(200000000 to 0)
    
        mov eax,a     //getting variables to registers
        mov ebx,b
    
    do_it_again3: //begin to loop
    
        //swapping with xor idiom
        xor eax,ebx
        xor ebx,eax         
        xor eax,ebx         
    
        mov a,eax
        mov b,ebx
    
        dec ecx           // j--
        jnz do_it_again3  // end of loop block
    }
    

    using

    • the XOR swap idiom
    • descending loop
    • implicit comparison flags (with dec ecx)

    My benchmark with Digital Mars Compiler Version 8.42n results in:

    time of for-loop(cycles) 572  
    time of while-loop(cycles)  566  
    time of custom-loop-1(cycles)   355   
    time of custom-loop-2(cycles)  317   
    time of custom-loop-3(cycles)  234   
    

    Full listing:

    #include<stdio.h>
    #include<stdlib.h>
    #include<time.h>
    
    int main()
    {
        int j=0;
    
        int a=0,b=0,temp=0;
    
        srand(time(0));
        time_t t1=0;
        time_t t2=0;
    
    
        t1=clock();
        for(int i=0; i<200000000; i++)
        {
            temp=a;//instruction 1
            a=b;//instruction 2
            b=temp;//3 instructions total
        }
        t2=clock();
        printf("\n time of for-loop(cycles) %i  \n",(t2-t1));
    
    
        t1=clock();
        while(j<200000000)
        {
            temp=a;//again it is three instructions
            a=b;
            b=temp;
            j++;
        }
        t2=clock();
        printf("\n time of while-loop(cycles)  %i  \n",(t2-t1));
    
    
        t1=clock();
        j=200000000;//setting the count
        __asm
        {
            pushf           //backup
            push eax        //backup
            push ebx        //backup
            push ecx        //backup
            push edx        //backup
    
            mov ecx,0       //init of loop range(0 to 200000000)
            mov edx,j
    
            do_it_again:    //begin to loop
    
    
            mov eax,a       //basic swap steps between cpu and mem(cache)
            mov ebx,b
            mov b,eax
            mov a,ebx       //four instructions total
    
            inc ecx         // j++
            cmp ecx,edx     //i<200000000  ?
            jb do_it_again  // end of loop block
    
            pop edx     //rolling back to history
            pop ecx
            pop ebx
            pop eax
            popf
        }
    
        t2=clock();
        printf("\n time of custom-loop-1(cycles)   %i   \n",(t2-t1));
    
        t1=clock();
        j=200000000;//setting the count
        __asm
        {
            pushf           //backup
                push eax        
                push ebx        
                push ecx        
                push edx        
    
                mov ecx,0       //init of loop range(0 to 200000000)
                mov edx,j
    
                mov eax,a       //getting variables to registers
                mov ebx,b
    
                do_it_again2:   //begin to loop
    
                //swapping with using only 2 variables(only in cpu)
                sub eax,ebx         //a is now a-b
                add ebx,eax         //b is now a
                sub eax,ebx         //a is now -b
                xor eax,80000000h   //a is now b and four instructions total
    
                inc ecx         // j++
                cmp ecx,edx     //i<200000000  ?
                jb do_it_again2  // end of loop block
    
                pop edx         //rollback
                pop ecx         
                pop ebx         
                pop eax         
                popf            
        }
    
        t2=clock();
        printf("\n time of custom-loop-2(cycles)  %i   \n",(t2-t1));
    
        t1=clock();
        j=200000000;//setting the count
        __asm
        {
            xor ecx,j     //init of loop range(200000000 to 0)
    
            mov eax,a     //getting variables to registers
            mov ebx,b
    
        do_it_again3:   //begin to loop
    
            //swapping with using only 2 variables(only in cpu)
            xor eax,ebx
            xor ebx,eax         
            xor eax,ebx         
    
            mov a,eax
            mov b,ebx
    
            dec ecx         // j--
            jnz do_it_again3  // end of loop block
        }
    
        t2=clock();
        printf("\n time of custom-loop-3(cycles)  %i   \n",(t2-t1));
    
        return 0;
    
    }
    
    0 讨论(0)
  • 2021-01-16 15:22

    I'm surprised that any of you guys got anything but zero cycles from the C code. Here, with gcc 4.6.3 and -O2, the loop vanishes away as there is no side-effect from it. Everything except the asm block is removed. I would be surprised if Digital Mars can't do such a trivial optimization; I bet you can try different optimization switches that will remove the C code, at which point such trivial comparison becomes impossible.

    Your toy example is useless to compare compiler optimizations with hand-crafted assembly. Statistically speaking, compilers can consistently write better machine code than humans.

    0 讨论(0)
  • 2021-01-16 15:30

    That's normal and changing the compiler wouldn't solve this "problem". Assembler is extremely low-level and you have control of everything. Your C++ compiler always does more than it needs. Calling a function would take more time than it would take in assembly, because the compiler protects the stack (for example). And in loop that's the same thing: Declare a new variable takes more time, add values also etc...

    This question should be interesting for some more information: When is assembler faster than C?

    0 讨论(0)
  • 2021-01-16 15:39

    It's a bit hard to guess what your compiler may be doing without seeing the assembly language result it creates. With VC++ 10, I get the following results:

    time of for-loop(cycles) 155
    
    time of while-loop(cycles)  158
    
    time of custom-loop-1(cycles)   369
    
    time of custom-loop-2(cycles)  314
    

    I didn't look at the output, but my immediate guess would be that the difference between the for and while loops is just noise. Both are obviously quite a bit faster than your hand-written assembly code though.

    Edit: looking at the assembly code, I was right -- the code for the for and the while is identical. It looks like this:

            call    _clock
            mov     ecx, DWORD PTR _a$[ebp]
            cdq
            mov     ebx, edx
            mov     edx, DWORD PTR _b$[ebp]
            mov     edi, eax
            mov     esi, 200000000
    $LL2@main:
    ; Line 28
            dec     esi
    ; Line 30
            mov     eax, ecx
    ; Line 31
            mov     ecx, edx
    ; Line 32
            mov     edx, eax
            jne     SHORT $LL2@main
            mov     DWORD PTR _b$[ebp], edx
            mov     DWORD PTR _a$[ebp], ecx
    ; Line 35
            call    _clock
    

    While arguably less "clever" than your second loop, modern CPUs tend to do best with simple code. It also just has fewer instructions inside the loop (and doesn't reference memory inside the loop at all). Those aren't the sole measures of efficiency by any means, but with this simple of a loop, they're fairly indicative.

    Edit 2:

    Just for fun, I wrote a new version that adds the triple-XOR swap, as well as one using the CPU's xchg instruction (just because that's how I'd probably write it by hand if I didn't care much about speed, etc.) Though Intel/AMD generally recommend against the more complex instructions, it doesn't seem to cause a problem -- it seems to be coming out at least as fast as anything else:

     time of for-loop(cycles) 156
    
     time of while-loop(cycles)  160
    
     time swap between register and cache  284
    
     time to swap using add/sub:  308
    
     time to swap using xchg:  155
    
     time to swap using triple-xor  233
    

    Source:

    // Note: updated source -- it was just too ugly to live. Same results though.
    #include<stdlib.h>
    #include<time.h>
    #include <iostream>
    #include <string>
    #include <iomanip>
    #include <sstream>
    
    namespace { 
        int a, b;
        const int loops = 200000000;
    }
    
    template <class swapper>
    struct timer {
        timer(std::string const &label) { 
            clock_t t1 = clock();
            swapper()();
            clock_t t2 = clock();
            std::ostringstream buffer;
            buffer << "Time for swap using " << label;
            std::cout << std::left << std::setw(30) << buffer.str() << " = " << (t2-t1) << "\n";
        }
    };
    
    struct for_loop {
        void operator()() {
            int temp;
            for(int i=0;i<loops;i++) {
                temp=a;//instruction 1
                a=b;//instruction 2
                b=temp;//3 instructions total   
            }
        }
    };
    
    struct while_loop {
        void operator()() { 
            int j = 0;
            int temp;
            while(j<loops) {
                temp=a;//again it is three instructions
                a=b;
                b=temp; 
                j++;
            }
        }
    };
    
    struct reg_mem {
        void operator()() {
            int j=loops;//setting the count
            __asm {
                mov ecx,0       //init of loop range(0 to 200000000)
                mov edx,j
        do_it_again:    //begin to loop
                mov eax,a       //basic swap steps between cpu and mem(cache)
                mov ebx,b       
                mov b,eax       
                mov a,ebx       //four instructions total
    
                inc ecx         // j++
                cmp ecx,edx     //i<200000000  ?
                jb do_it_again  // end of loop block
            }
        }
    };
    
    struct add_sub {
        void operator()() { 
            int j=loops;//setting the count
            __asm {
                mov ecx,0       //init of loop range(0 to 200000000)
                mov edx,j
    
                mov eax,a       //getting variables to registers
                mov ebx,b
    
        do_it_again2:   //begin to loop
    
                //swapping with using only 2 variables(only in cpu)
                sub eax,ebx         //a is now a-b
                add ebx,eax         //b is now a
                sub eax,ebx         //a is now -b
                xor eax,80000000h   //a is now b and four instructions total
    
                inc ecx         // j++
                cmp ecx,edx     //i<200000000  ?
                jb do_it_again2  // end of loop block
    
                mov a, eax
                mov b, ebx
            }
        }
    };
    
    struct xchg {
        void operator()() {
            __asm {
                mov ecx, loops
                mov eax, a
                mov ebx, b
        do_it_again3:
                dec ecx
                xchg eax, ebx
                jne do_it_again3
                mov a, eax
                mov b, ebx
            }
        }
    };
    
    struct xor3 {
        void operator()() { 
            _asm { 
                mov ecx, loops
                mov eax, a
                mov edx, b
        do_swap4:
                xor eax, edx
                xor edx, eax
                xor eax, edx
                dec ecx
                jnz do_swap4
    
                mov a, eax
                mov b, edx
            }
        }
    };
    
    int main() {
        timer<for_loop>("for loop");
        timer<while_loop>("while loop");
        timer<reg_mem>("reg<->mem");
        timer<add_sub>("add/sub");
        timer<xchg>("xchg");
        timer<xor3>("triple xor");
        return 0;
    }
    

    Bottom line: at least for this trivial of a task, you're not going to beat a decent compiler by enough to care about (and probably not at all, except possibly in terms of minutely smaller code).

    0 讨论(0)
  • 2021-01-16 15:39

    The code generated by that compiler is pretty horrible. After disassembling the object file with objconv, here's what I got in regards to the first for loop.

    ?_001:  cmp     dword [ebp-4H], 200000000               ; 0053 _ 81. 7D, FC, 0BEBC200
            jge     ?_002                                   ; 005A _ 7D, 17
            inc     dword [ebp-4H]                          ; 005C _ FF. 45, FC
            mov     eax, dword [ebp-18H]                    ; 005F _ 8B. 45, E8
            mov     dword [ebp-10H], eax                    ; 0062 _ 89. 45, F0
            mov     eax, dword [ebp-14H]                    ; 0065 _ 8B. 45, EC
            mov     dword [ebp-18H], eax                    ; 0068 _ 89. 45, E8
            mov     eax, dword [ebp-10H]                    ; 006B _ 8B. 45, F0
            mov     dword [ebp-14H], eax                    ; 006E _ 89. 45, EC
            jmp     ?_001                                   ; 0071 _ EB, E0
    

    The issues should be clear to anybody who ever looked at some assembly.

    1. The loop is very tightly dependent on the value that is put in eax. This makes any out-of-order execution practically impossible due to dependencies created on that register by every next instruction.

    2. There are six general-purpose registers available (since ebp and esp aren't really general-purpose in most of the setups), but your compiler uses none of them, falling back to using the local stack. This is absolutely unacceptable when speed is the optimization goal. We can even see that the current loop index is stored at [ebp-4H], while it could've been easily stored in a register.

    3. The cmp instruction uses a memory and an immediate operand. This is the slowest possible mix of operands and should never be used when performance is at stake.

    4. And don't get me started on the code size. Half of those instructions are just unnecessary.

    All in all, the first thing I'd do is ditch that compiler at the earliest possible chance. But then again, seeing that it offers "memory models" as one of its options, one can't really seem to have much hope.

    0 讨论(0)
提交回复
热议问题