Why is this Python NumPy code,
import numpy as np
import time
k_max = 40000
N = 10000
data = np.zeros((2,N))
coefs = np.zeros((k_max,2),dtype=float)
t1 = t
I found this question interesting, because every time I encountered similar topic about the speed of NumPy (compared to C/C++) there was always answers like "it's a thin wrapper, its core is written in C, so it's fats", but this doesn't explain why C should be slower than C with additional layer (even a thin one).
The answer is: your C++ code is not slower than your Python code when properly compiled.
I've done some benchmarks, and at first it seemed that NumPy is surprisingly faster. But I forgot about optimizing the compilation with GCC.
I've computed everything again and also compared results with a pure C version of your code. I am using GCC version 4.9.2, and Python 2.7.9 (compiled from the source with the same GCC). To compile your C++ code I used g++ -O3 main.cpp -o main
, to compile my C code I used gcc -O3 main.c -lm -o main
. In all examples I filled data
variables with some numbers (0.1, 0.4), as it changes results. I also changed np.arrays to use doubles (dtype=np.float64
), because there are doubles in C++ example. My pure C version of your code (it's similar):
#include <math.h>
#include <stdio.h>
#include <time.h>
const int k_max = 100000;
const int N = 10000;
int main(void)
{
clock_t t_start, t_end;
double data1[N], data2[N], coefs1[k_max], coefs2[k_max], seconds;
int z;
for( z = 0; z < N; z++ )
{
data1[z] = 0.1;
data2[z] = 0.4;
}
int i, j;
t_start = clock();
for( i = 0; i < k_max; i++ )
{
for( j = 0; j < N-1; j++ )
{
coefs1[i] += data2[j] * (cos((i+1) * data1[j]) - cos((i+1) * data1[j+1]));
coefs2[i] += data2[j] * (sin((i+1) * data1[j]) - sin((i+1) * data1[j+1]));
}
}
t_end = clock();
seconds = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("Time: %f s\n", seconds);
return coefs1[0];
}
For k_max = 100000, N = 10000
results where following:
Python and C++ have basically the same time, but note that there is a Python loop of length k_max, which should be much slower compared to C/C++ one. And it is.
For k_max = 1000000, N = 1000
we have:
For k_max = 1000000, N = 100
:
So the difference increases with fraction k_max/N
, but python is not faster even for N
much bigger than k_max
, e. g. k_max = 100, N = 100000
:
Obviously, the main speed difference between C/C++ and Python is in the for
loop. But I wanted to find out the difference between simple operations on arrays in NumPy and in C. Advantages of using NumPy in your code consists of: 1. multiplying the whole array by a number, 2. calculating sin/cos of the whole array, 3. summing all elements of the array, instead of doing those operations on every single item separately. So I prepared two scripts to compare only these operations.
Python script:
import numpy as np
from time import time
N = 10000
x_len = 100000
def main():
x = np.ones(x_len, dtype=np.float64) * 1.2345
start = time()
for i in xrange(N):
y1 = np.cos(x, dtype=np.float64)
end = time()
print('cos: {} s'.format(end-start))
start = time()
for i in xrange(N):
y2 = x * 7.9463
end = time()
print('multi: {} s'.format(end-start))
start = time()
for i in xrange(N):
res = np.sum(x, dtype=np.float64)
end = time()
print('sum: {} s'.format(end-start))
return y1, y2, res
if __name__ == '__main__':
main()
# results
# cos: 22.7199969292 s
# multi: 0.841291189194 s
# sum: 1.15971088409 s
C script:
#include <math.h>
#include <stdio.h>
#include <time.h>
const int N = 10000;
const int x_len = 100000;
int main()
{
clock_t t_start, t_end;
double x[x_len], y1[x_len], y2[x_len], res, time;
int i, j;
for( i = 0; i < x_len; i++ )
{
x[i] = 1.2345;
}
t_start = clock();
for( j = 0; j < N; j++ )
{
for( i = 0; i < x_len; i++ )
{
y1[i] = cos(x[i]);
}
}
t_end = clock();
time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("cos: %f s\n", time);
t_start = clock();
for( j = 0; j < N; j++ )
{
for( i = 0; i < x_len; i++ )
{
y2[i] = x[i] * 7.9463;
}
}
t_end = clock();
time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("multi: %f s\n", time);
t_start = clock();
for( j = 0; j < N; j++ )
{
res = 0.0;
for( i = 0; i < x_len; i++ )
{
res += x[i];
}
}
t_end = clock();
time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("sum: %f s\n", time);
return y1[0], y2[0], res;
}
// results
// cos: 20.910590 s
// multi: 0.633281 s
// sum: 1.153001 s
Python results:
C results:
As you can see NumPy is incredibly fast, but always a bit slower than pure C.
I tried to understand your Python code and reproduce it in C++. I found that you didn't represent correctly the for-loops in order to do the correct calculations of the coeffs, hence should switch your for-loops. If this is the case, you should have the following:
#include <iostream>
#include <cmath>
#include <time.h>
const int k_max = 40000;
const int N = 10000;
double cos_k, sin_k;
int main(int argc, char const *argv[])
{
time_t start, stop;
double data[2][N];
double coefs[k_max][2];
time(&start);
for(int i=0; i<k_max; ++i)
{
for(int j=0; j<N; ++j)
{
coefs[i][0] += data[1][j-1] * (cos((i+1) * data[0][j-1]) - cos((i+1) * data[0][j]));
coefs[i][1] += data[1][j-1] * (sin((i+1) * data[0][j-1]) - sin((i+1) * data[0][j]));
}
}
// End of main loop
time(&stop);
// Speed result
double diff = difftime(stop, start);
std::cout << "Time: " << diff << " seconds" << std::endl;
return 0;
}
Switching the for-loops gives me: 3 seconds for C++ code, optimized with -O3, while Python code runs at 7.816 seconds.
On my computer, your (current) Python code runs in 14.82 seconds (yes, my computer's quite slow).
I rewrote your C++ code to something I'd consider halfway reasonable (basically, I almost ignored your C++ code and just rewrote your Python into C++. That gave me this:
#include <cstdio>
#include <iostream>
#include <cmath>
#include <chrono>
#include <vector>
#include <assert.h>
const unsigned int k_max = 40000;
const unsigned int N = 10000;
template <class T>
class matrix2 {
std::vector<T> data;
size_t cols;
size_t rows;
public:
matrix2(size_t y, size_t x) : cols(x), rows(y), data(x*y) {}
T &operator()(size_t y, size_t x) {
assert(x <= cols);
assert(y <= rows);
return data[y*cols + x];
}
T operator()(size_t y, size_t x) const {
assert(x <= cols);
assert(y <= rows);
return data[y*cols + x];
}
};
int main() {
matrix2<double> data(N, 2);
matrix2<double> coeffs(k_max, 2);
using namespace std::chrono;
auto start = high_resolution_clock::now();
for (int k = 0; k < k_max; k++) {
for (int j = 0; j < N - 1; j++) {
coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
}
}
auto end = high_resolution_clock::now();
std::cout << duration_cast<milliseconds>(end - start).count() << " ms\n";
}
This ran in about 14.4 seconds, so it's a slight improvement over the Python version--but given that the Python is mostly a pretty thin wrapper around some C code, getting only a slight improvement is pretty much what we should expect.
The next obvious step would be to use multiple cores. To do that in C++, we can add this line:
#pragma omp parallel for
...before the outer for
loop:
#pragma omp parallel for
for (int k = 0; k < k_max; k++) {
for (int j = 0; j < N - 1; j++) {
coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
}
}
With -openmp
added to the compiler's command line, this ran in about 4.8 seconds. If you have more than 4 cores, you can probably expect a larger improvement than that though (conversely, if you have fewer than 4 cores, expect a smaller improvement--but nowadays, more than 4 is a lot more common that fewer).