How slow (how many cycles) is calculating a square root?

后端 未结 3 1632
礼貌的吻别
礼貌的吻别 2020-11-30 08:40

How slow (how many cycles) is calculating a square root? This came up in a molecular dynamics course where efficiency is important and taking unnecessary square roots had a

相关标签:
3条回答
  • 2020-11-30 09:00

    Square root takes several cycles, but it takes orders of magnitude more to access memory if it is not in cache. Therefore, trying to avoid computations by fetching pre-computed results from memory may actually be detrimental to performance.

    It's difficult to say in the abstract whether you might gain or not, so if you want to know for sure, try and benchmark both approaches.

    Here's a great talk on the matter by Eric Brummer, Compiler Developer on MSVC: http://channel9.msdn.com/Events/Build/2013/4-329

    0 讨论(0)
  • 2020-11-30 09:06

    From Agner Fog's Instruction Tables:

    On Core2 65nm, FSQRT takes 9 to 69 cc's (with almost equal reciprocal throughput), depending on the value and precision bits. For comparison, FDIV takes 9 to 38 cc's (with almost equal reciprocal throughput), FMUL takes 5 (recipthroughput = 2) and FADD takes 3 (recipthroughput = 1). SSE performance is about equal, but looks faster because it can't do 80bit math. SSE has a super fast approximate reciprocal and approximate reciprocal sqrt though.

    On Core2 45nm, division and square root got faster; FSQRT takes 6 to 20 cc's, FDIV takes 6 to 21 cc's, FADD and FMUL haven't changed. Once again SSE performance is about the same.

    You can get the documents with this information from his website.

    0 讨论(0)
  • 2020-11-30 09:11

    Square root is about 4 times slower than addition using -O2, or about 13 times slower without using -O2. Elsewhere on the net I found estimates of 50-100 cycles which may be true, but it's not a relative measure of cost that is very useful, so I threw together the code below to make a relative measurement. Let me know if you see any problems with the test code.

    The code below was run on an Intel Core i3 under Windows 7 operating system and was compiled in DevC++ (which uses GCC). Your mileage may vary.

    #include <cstdlib>
    #include <iostream>
    #include <cmath>
    
    /*
    Output using -O2:
    
    1 billion square roots running time: 14738ms
    
    1 billion additions running time   : 3719ms
    
    Press any key to continue . . .
    
    Output without -O2:
    
    10 million square roots running time: 870ms
    
    10 million additions running time   : 66ms
    
    Press any key to continue . . .
    
    Results:
    
    Square root is about 4 times slower than addition using -O2,
                or about 13 times slower without using -O2
    */
    
    int main(int argc, char *argv[]) {
    
        const int cycles = 100000;
        const int subcycles = 10000;
    
        double squares[cycles];
    
        for ( int i = 0; i < cycles; ++i ) {
            squares[i] = rand();
        }
    
        std::clock_t start = std::clock();
    
        for ( int i = 0; i < cycles; ++i ) {
            for ( int j = 0; j < subcycles; ++j ) {
                squares[i] = sqrt(squares[i]);
            }
        }
    
        double time_ms = ( ( std::clock() - start ) / (double) CLOCKS_PER_SEC ) * 1000;
    
        std::cout << "1 billion square roots running time: " << time_ms << "ms" << std::endl;
    
        start = std::clock();
    
        for ( int i = 0; i < cycles; ++i ) {
            for ( int j = 0; j < subcycles; ++j ) {
                squares[i] = squares[i] + squares[i];
            }
        }
    
        time_ms = ( ( std::clock() - start ) / (double) CLOCKS_PER_SEC ) * 1000;
    
        std::cout << "1 billion additions running time   : " << time_ms << "ms" << std::endl;
    
        system("PAUSE");
        return EXIT_SUCCESS;
    }
    
    0 讨论(0)
提交回复
热议问题