Unusual Speed Difference between Python and C++

后端 未结 17 2055
庸人自扰
庸人自扰 2020-12-22 21:25

I recently wrote a short algorithm to calculate happy numbers in python. The program allows you to pick an upper bound and it will determine all the happy numbers below it.

相关标签:
17条回答
  • 2020-12-22 21:36

    There's a new, radically faster version as a separate answer, so this answer is deprecated.


    I rewrote your algorithm by making it cache whenever it finds the number to be happy or unhappy. I also tried to make it as pythonic as I could, for example by creating separate functions digits() and happy(). Sorry for using Python 3, but I get to show off a couple a useful things from it as well.

    This version is much faster. It runs at 1.7s which is 10 times faster than your original program that takes 18s (well, my MacBook is quite old and slow :) )

    #!/usr/bin/env python3
    
    from timeit import Timer
    from itertools import count
    
    print_numbers = False
    upperBound = 10**5  # Default value, can be overidden by user.
    
    
    def digits(x:'nonnegative number') -> "yields number's digits":
        if not (x >= 0): raise ValueError('Number should be nonnegative')
        while x:
            yield x % 10
            x //= 10
    
    
    def happy(number, known = {1}, happies = {1}) -> 'True/None':
        '''This function tells if the number is happy or not, caching results.
    
        It uses two static variables, parameters known and happies; the
        first one contains known happy and unhappy numbers; the second 
        contains only happy ones.
    
        If you want, you can pass your own known and happies arguments. If
        you do, you should keep the assumption commented out on the 1 line.
    
        '''
    
    #        assert 1 in known and happies <= known  # <= is expensive
    
        if number in known:
            return number in happies
    
        history = set()
        while True:
            history.add(number)
            number = sum(x**2 for x in digits(number))
            if number in known or number in history:
                break
    
        known.update(history)
        if number in happies:
            happies.update(history)
            return True
    
    
    def calcMain():
        happies = {x for x in range(upperBound) if happy(x) }
        if print_numbers:
            print(happies)
    
    
    if __name__ == '__main__':
        upperBound = eval(
                input("Pick an upper bound [default {0}]: "
                        .format(upperBound)).strip()
                or repr(upperBound))
        result = Timer(calcMain).timeit(1)
        print ('This computation took {0} seconds'.format(result))
    
    0 讨论(0)
  • There are quite a few optimizations possible:

    (1) Use const references

    bool inVector(int inQuestion, const vector<int>& known)
    {
        for(vector<int>::const_iterator it = known.begin(); it != known.end(); ++it)
            if(*it == inQuestion)
                return true;
        return false;
    }
    
    int sum(const vector<int>& given)
    {
        int sum = 0;
        for(vector<int>::const_iterator it = given.begin(); it != given.end(); ++it)
            sum += *it;
        return sum;
    }
    

    (2) Use counting down loops

    int pow(int given, int power)
    {
        int current = 1;
        while(power--)
            current *= given;
        return current;
    }
    

    Or, as others have said, use the standard library code.

    (3) Don't allocate buffers where not required

            vector<int> squares;
            for (int temp = current; temp != 0; temp /= 10)
            {
                squares.push_back(pow(temp % 10, 2));
            }
    
    0 讨论(0)
  • 2020-12-22 21:38

    Other optimizations: by using arrays and direct access using the loop index rather than searching in a vector, and by caching prior sums, the following code (inspired by Dr Asik's answer but probably not optimized at all) runs 2445 times faster than the original C++ code, about 400 times faster than the Python code.

    #include <iostream>
    #include <windows.h>
    #include <vector>
    
    void calcMain(int upperBound, std::vector<int>& known)
    {
        int tempDigitCounter = upperBound;
        int numDigits = 0;
        while (tempDigitCounter > 0)
        {
            numDigits++;
            tempDigitCounter /= 10;
        }
        int maxSlots = numDigits * 9 * 9;
        int* history = new int[maxSlots + 1];
    
        int* cache = new int[upperBound+1];
        for (int jj = 0; jj <= upperBound; jj++)
        {
            cache[jj] = 0;
        }
    
        int current, sum, temp;
        for(int i = 0; i <= upperBound; i++)
        {
            current = i;
            while(true)
            {
                sum = 0;
                temp = current;
    
                bool inRange = temp <= upperBound;
                if (inRange)
                {
                    int cached = cache[temp];
                    if (cached)
                    {
                        sum = cached;
                    }
                }
    
                if (sum == 0)
                {
                    while (temp > 0)
                    {
                        int tempMod = temp % 10;
                        sum += tempMod * tempMod;
                        temp /= 10;
                    }
                    if (inRange)
                    {
                        cache[current] = sum;
                    }
                }
                current = sum;
                if(history[current] == i)
                {
                    if(current == 1)
                    {
                        known.push_back(i);
                    }
                    break;
                }
                history[current] = i;
            }
        }
    }
    
    int main()
    {
        while(true)
        {
            int upperBound;
            std::vector<int> known;
            std::cout << "Pick an upper bound: ";
            std::cin >> upperBound;
            long start, end;
            start = GetTickCount();
            calcMain(upperBound, known);
            end = GetTickCount();
            for (size_t i = 0; i < known.size(); ++i) {
                std::cout << known[i] << ", ";
            }               
            double seconds = (double)(end-start) / 1000.0;
            std::cout << std::endl << seconds << " seconds." << std::endl << std::endl;
        }
        return 0;
    }
    
    0 讨论(0)
  • 2020-12-22 21:38

    Why is everyone using a vector in the c++ version? Lookup time is O(N).

    Even though it's not as efficient as the python set, use std::set. Lookup time is O(log(N)).

    0 讨论(0)
  • 2020-12-22 21:39

    For 100000 elements, the Python code took 6.9 seconds while the C++ originally took above 37 seconds.

    I did some basic optimizations on your code and managed to get the C++ code above 100 times faster than the Python implementation. It now does 100000 elements in 0.06 seconds. That is 617 times faster than the original C++ code.

    The most important thing is to compile in Release mode, with all optimizations. This code is literally orders of magnitude slower in Debug mode.

    Next, I will explain the optimizations I did.

    • Moved all vector declarations outside of the loop; replaced them by a clear() operation, which is much faster than calling the constructor.
    • Replaced the call to pow(value, 2) by a multiplication : value * value.
    • Instead of having a squares vector and calling sum on it, I sum the values in-place using just an integer.
    • Avoided all string operations, which are very slow compared to integer operations. For instance, it is possible to compute the squares of each digit by repeatedly dividing by 10 and fetching the modulus 10 of the resulting value, instead of converting the value to a string and then each character back to int.
    • Avoided all vector copies, first by replacing passing by value with passing by reference, and finally by eliminating the helper functions completely.
    • Eliminated a few temporary variables.
    • And probably many small details I forgot. Compare your code and mine side-by-side to see exactly what I did.

    It may be possible to optimize the code even more by using pre-allocated arrays instead of vectors, but this would be a bit more work and I'll leave it as an exercise to the reader. :P

    Here's the optimized code :

    #include <iostream>
    #include <vector>
    #include <string>
    #include <ctime>
    #include <algorithm>
    #include <windows.h>
    
    using namespace std;
    
    void calcMain(int upperBound, vector<int>& known);
    
    int main()
    {
        while(true)
        {
            vector<int> results;
            int upperBound;
            cout << "Pick an upper bound: ";
            cin >> upperBound;
            long start, end;
            start = GetTickCount();
            calcMain(upperBound, results);
            end = GetTickCount();
            for (size_t i = 0; i < results.size(); ++i) {
                cout << results[i] << ", ";
            }
            cout << endl;
            double seconds = (double)(end-start) / 1000.0;
            cout << seconds << " seconds." << endl << endl;
        }
        return 0;
    }
    
    void calcMain(int upperBound, vector<int>& known)
    {
        vector<int> history;
        for(int i = 0; i <= upperBound; i++)
        {
            int current = i;
            history.clear();
            while(true)
            {
                    int temp = current;
                    int sum = 0;
                    while (temp > 0) {
                        sum += (temp % 10) * (temp % 10);
                        temp /= 10;
                    }
                    current = sum;
                    if(find(history.begin(), history.end(), current) != history.end())
                    {
                            if(current == 1)
                            {
                                    known.push_back(i);
                            }
                            break;
                    }
                    history.push_back(current);
            }
        }
    }
    
    0 讨论(0)
  • 2020-12-22 21:46

    Stumbled over this page whilst bored and thought I'd golf it in js. The algorithm is my own, and I haven't checked it thoroughly against anything other than my own calculations (so it could be wrong). It calculates the first 1e7 happy numbers and stores them in h. If you want to change it, change both the 7s.

    m=1e7,C=7*81,h=[1],t=true,U=[,,,,t],n=w=2;
    while(n<m){
    z=w,s=0;while(z)y=z%10,s+=y*y,z=0|z/10;w=s;
    if(U[w]){if(n<C)U[n]=t;w=++n;}else if(w<n)h.push(n),w=++n;}
    

    This will print the first 1000 items for you in console or a browser:

    o=h.slice(0,m>1e3?1e3:m);
    (!this.document?print(o):document.load=document.write(o.join('\n')));
    

    155 characters for the functional part and it appears to be as fast* as Dr. Asik's offering on firefox or v8 (350-400 times as fast as the original python program on my system when running time d8 happygolf.js or js -a -j -p happygolf.js in spidermonkey).
    I shall be in awe of the analytic skills anyone who can figure out why this algorithm is doing so well without referencing the longer, commented, fortran version.

    I was intrigued by how fast it was, so I learned fortran to get a comparison of the same algorithm, be kind if there are any glaring newbie mistakes, it's my first fortran program. http://pastebin.com/q9WFaP5C It's static memory wise, so to be fair to the others, it's in a self-compiling shell script, if you don't have gcc/bash/etc strip out the preprocessor and bash stuff at the top, set the macros manually and compile it as fortran95.

    Even if you include compilation time it beats most of the others here. If you don't, it's about ~3000-3500 times as fast as the original python version (and by extension >40,000 times as fast as the C++*, although I didn't run any of the C++ programs).

    Surprisingly many of the optimizations I tried in the fortran version (incl some like loop unrolling which I left out of the pasted version due to small effect and readability) were detrimental to the js version. This exercise shows that modern trace compilers are extremely good (within a factor of 7-10 of carefully optimized, static memory fortran) if you get out of their way and don't try any tricky stuff. get out of their way, and trying to do tricky stuff Finally, here's a much nicer, more recursive js version.

    // to s, then integer divides x by 10.
    // Repeats until x is 0.
    function sumsq(x) {
      var y,s=0;
      while(x) {
        y = x % 10; 
        s += y * y;
        x = 0| x / 10; 
      }
      return s;
    }
    // A boolean cache for happy().
    // The terminating happy number and an unhappy number in
    // the terminating sequence.
    var H=[];
    H[1] = true;
    H[4] = false;
    // Test if a number is happy.
    // First check the cache, if that's empty
    // Perform one round of sumsq, then check the cache
    // For that. If that's empty, recurse.
    function happy(x) {
      // If it already exists.
      if(H[x] !== undefined) {
        // Return whatever is already in cache.
        return H[x];
      } else {
        // Else calc sumsq, set and  return cached val, or if undefined, recurse.
        var w = sumsq(x);
        return (H[x] = H[w] !== undefined? H[w]: happy(w));
      }
    }
    //Main program loop.
    var i, hN = []; 
    for(i = 1; i < 1e7; i++) {
      if(happy(i)) { hN.push(i); }
    }
    

    Surprisingly, even though it is rather high level, it did almost exactly as well as the imperative algorithm in spidermonkey (with optimizations on), and close (1.2 times as long) in v8.

    Moral of the story I guess, spend a bit of time thinking about your algorithm if it's important. Also high level languages already have a lot of overhead (and sometimes have tricks of their own to reduce it) so sometimes doing something more straightforwared or utilizing their high level features is just as fast. Also micro-optimization doesn't always help.

    *Unless my python installation is unusually slow, direct times are somewhat meaningless as this is a first generation eee. Times are:
    12s for fortran version, no output, 1e8 happy numbers.
    40s for fortran version, pipe output through gzip to disk.
    8-12s for both js versions. 1e7 happy numbers, no output with full optimization 10-100s for both js versions 1e7 with less/no optimization (depending on definition of no optimization, the 100s was with eval()) no output

    I'd be interested to see times for these programs on a real computer.

    0 讨论(0)
提交回复
热议问题