Why is my Python NumPy code faster than C++?

后端 未结 3 931
隐瞒了意图╮
隐瞒了意图╮ 2021-01-01 03:54

Why is this Python NumPy code,

import numpy as np
import time

k_max = 40000
N = 10000

data = np.zeros((2,N))
coefs = np.zeros((k_max,2),dtype=float)

t1 = t         


        
相关标签:
3条回答
  • 2021-01-01 04:39

    I found this question interesting, because every time I encountered similar topic about the speed of NumPy (compared to C/C++) there was always answers like "it's a thin wrapper, its core is written in C, so it's fats", but this doesn't explain why C should be slower than C with additional layer (even a thin one).

    The answer is: your C++ code is not slower than your Python code when properly compiled.

    I've done some benchmarks, and at first it seemed that NumPy is surprisingly faster. But I forgot about optimizing the compilation with GCC.

    I've computed everything again and also compared results with a pure C version of your code. I am using GCC version 4.9.2, and Python 2.7.9 (compiled from the source with the same GCC). To compile your C++ code I used g++ -O3 main.cpp -o main, to compile my C code I used gcc -O3 main.c -lm -o main. In all examples I filled data variables with some numbers (0.1, 0.4), as it changes results. I also changed np.arrays to use doubles (dtype=np.float64), because there are doubles in C++ example. My pure C version of your code (it's similar):

    #include <math.h>
    #include <stdio.h>
    #include <time.h>
    
    const int k_max = 100000;
    const int N = 10000;
    
    int main(void)
    {
        clock_t t_start, t_end;
        double data1[N], data2[N], coefs1[k_max], coefs2[k_max], seconds;
        int z;
        for( z = 0; z < N; z++ )
        {
            data1[z] = 0.1;
            data2[z] = 0.4;
        }
    
        int i, j;
        t_start = clock();
        for( i = 0; i < k_max; i++ )
        {
            for( j = 0; j < N-1; j++ )
            {
                coefs1[i] += data2[j] * (cos((i+1) * data1[j]) - cos((i+1) * data1[j+1]));
                coefs2[i] += data2[j] * (sin((i+1) * data1[j]) - sin((i+1) * data1[j+1]));
            }
        }
        t_end = clock();
    
        seconds = (double)(t_end - t_start) / CLOCKS_PER_SEC;
        printf("Time: %f s\n", seconds);
        return coefs1[0];
    }
    

    For k_max = 100000, N = 10000 results where following:

    • Python 70.284362 s
    • C++ 69.133199 s
    • C 61.638186 s

    Python and C++ have basically the same time, but note that there is a Python loop of length k_max, which should be much slower compared to C/C++ one. And it is.

    For k_max = 1000000, N = 1000 we have:

    • Python 115.42766 s
    • C++ 70.781380 s

    For k_max = 1000000, N = 100:

    • Python 52.86826 s
    • C++ 7.050597 s

    So the difference increases with fraction k_max/N, but python is not faster even for N much bigger than k_max, e. g. k_max = 100, N = 100000:

    • Python 0.651587 s
    • C++ 0.568518 s

    Obviously, the main speed difference between C/C++ and Python is in the for loop. But I wanted to find out the difference between simple operations on arrays in NumPy and in C. Advantages of using NumPy in your code consists of: 1. multiplying the whole array by a number, 2. calculating sin/cos of the whole array, 3. summing all elements of the array, instead of doing those operations on every single item separately. So I prepared two scripts to compare only these operations.

    Python script:

    import numpy as np
    from time import time
    
    N = 10000
    x_len = 100000
    
    def main():
        x = np.ones(x_len, dtype=np.float64) * 1.2345
    
        start = time()
        for i in xrange(N):
            y1 = np.cos(x, dtype=np.float64)
        end = time()
        print('cos: {} s'.format(end-start))
    
        start = time()
        for i in xrange(N):
            y2 = x * 7.9463
        end = time()
        print('multi: {} s'.format(end-start))
    
        start = time()
        for i in xrange(N):
            res = np.sum(x, dtype=np.float64)
        end = time()
        print('sum: {} s'.format(end-start))
    
        return y1, y2, res
    
    if __name__ == '__main__':
        main()
    
    # results
    # cos: 22.7199969292 s
    # multi: 0.841291189194 s
    # sum: 1.15971088409 s
    

    C script:

    #include <math.h>
    #include <stdio.h>
    #include <time.h>
    
    const int N = 10000;
    const int x_len = 100000;
    
    int main()
    {
        clock_t t_start, t_end;
        double x[x_len], y1[x_len], y2[x_len], res, time;
        int i, j;
        for( i = 0; i < x_len; i++ )
        {
            x[i] = 1.2345;
        }
    
        t_start = clock();
        for( j = 0; j < N; j++ )
        {
            for( i = 0; i < x_len; i++ )
            {
                y1[i] = cos(x[i]);
            }
        }
        t_end = clock();
        time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
        printf("cos: %f s\n", time);
    
        t_start = clock();
        for( j = 0; j < N; j++ )
        {
            for( i = 0; i < x_len; i++ )
            {
                y2[i] = x[i] * 7.9463;
            }
        }
        t_end = clock();
        time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
        printf("multi: %f s\n", time);
    
        t_start = clock();
        for( j = 0; j < N; j++ )
        {
            res = 0.0;
            for( i = 0; i < x_len; i++ )
            {
                res += x[i];
            }
        }
        t_end = clock();
        time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
        printf("sum: %f s\n", time);
    
        return y1[0], y2[0], res;
    }
    
    // results
    // cos: 20.910590 s
    // multi: 0.633281 s
    // sum: 1.153001 s
    

    Python results:

    • cos: 22.7199969292 s
    • multi: 0.841291189194 s
    • sum: 1.15971088409 s

    C results:

    • cos: 20.910590 s
    • multi: 0.633281 s
    • sum: 1.153001 s

    As you can see NumPy is incredibly fast, but always a bit slower than pure C.

    0 讨论(0)
  • 2021-01-01 04:41

    I tried to understand your Python code and reproduce it in C++. I found that you didn't represent correctly the for-loops in order to do the correct calculations of the coeffs, hence should switch your for-loops. If this is the case, you should have the following:

    #include <iostream>
    #include <cmath>
    #include <time.h>
    
    const int k_max = 40000;
    const int N = 10000;
    
    double cos_k, sin_k;
    
    int main(int argc, char const *argv[])
    {
        time_t start, stop;
        double data[2][N];
        double coefs[k_max][2];
    
        time(&start);
    
        for(int i=0; i<k_max; ++i)
        {
            for(int j=0; j<N; ++j)
            {
                coefs[i][0] += data[1][j-1] * (cos((i+1) * data[0][j-1]) - cos((i+1) * data[0][j]));
                coefs[i][1] += data[1][j-1] * (sin((i+1) * data[0][j-1]) - sin((i+1) * data[0][j]));
            }
        }
        // End of main loop
    
        time(&stop);
    
        // Speed result
        double diff = difftime(stop, start);
        std::cout << "Time: " << diff << " seconds" << std::endl;
    
        return 0;
    }
    

    Switching the for-loops gives me: 3 seconds for C++ code, optimized with -O3, while Python code runs at 7.816 seconds.

    0 讨论(0)
  • 2021-01-01 04:44

    On my computer, your (current) Python code runs in 14.82 seconds (yes, my computer's quite slow).

    I rewrote your C++ code to something I'd consider halfway reasonable (basically, I almost ignored your C++ code and just rewrote your Python into C++. That gave me this:

    #include <cstdio>
    #include <iostream>
    #include <cmath>
    #include <chrono>
    #include <vector>
    #include <assert.h>
    
    const unsigned int k_max = 40000;
    const unsigned int N = 10000;
    
    template <class T>
    class matrix2 {
        std::vector<T> data;
        size_t cols;
        size_t rows;
    public:
        matrix2(size_t y, size_t x) : cols(x), rows(y), data(x*y) {}
        T &operator()(size_t y, size_t x) {
            assert(x <= cols);
            assert(y <= rows);
            return data[y*cols + x];
        }
    
        T operator()(size_t y, size_t x) const {
            assert(x <= cols);
            assert(y <= rows);
            return data[y*cols + x];
        }
    };
    
    int main() {
        matrix2<double> data(N, 2);
        matrix2<double> coeffs(k_max, 2);
    
        using namespace std::chrono;
    
        auto start = high_resolution_clock::now();
    
        for (int k = 0; k < k_max; k++) {
            for (int j = 0; j < N - 1; j++) {
                coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
                coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
            }
        }
    
        auto end = high_resolution_clock::now();
        std::cout << duration_cast<milliseconds>(end - start).count() << " ms\n";
    }
    

    This ran in about 14.4 seconds, so it's a slight improvement over the Python version--but given that the Python is mostly a pretty thin wrapper around some C code, getting only a slight improvement is pretty much what we should expect.

    The next obvious step would be to use multiple cores. To do that in C++, we can add this line:

    #pragma omp parallel for
    

    ...before the outer for loop:

    #pragma omp parallel for
    for (int k = 0; k < k_max; k++) {
        for (int j = 0; j < N - 1; j++) {
            coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
            coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
        }
    }
    

    With -openmp added to the compiler's command line, this ran in about 4.8 seconds. If you have more than 4 cores, you can probably expect a larger improvement than that though (conversely, if you have fewer than 4 cores, expect a smaller improvement--but nowadays, more than 4 is a lot more common that fewer).

    0 讨论(0)
提交回复
热议问题