I have this C++ function, which I can call from Python with the code below. The performance is only half compared to running pure C++. Is there a way to get their performanc
The problem here is not what is happening during the run, but which optimization is happening during the compilation.
Which optimization is done depends on the compiler (or even version) and there is no guarantee that every optimization, which can be done will be done.
Actually there are two different reasons why cython is slower, depending on whether you use g++ or clang++:
-fwrapv
in the cython buildFirst issue (g++): Cython compiles with different flags compared to the flags of your pure c++-program and as result some optimizations can't be done.
If you look at the log of the setup, you will see:
x86_64-linux-gnu-gcc ... -O2 ..-fwrapv .. -c diff.cpp ... -Ofast -march=native
As you told, -Ofast
will win against -O2
because it comes last. But the problem is -fwrapv
, which seems to prevent some optimization, as signed integer overflow cannot be considered UB and used for optimization any longer.
So you have following options:
-fno-wrapv
to extra_compile_flags
, the disadvantage is, that all files are now compiled with changed flags, what might be unwanted.Second issue (clang++) inlining in the test cpp-program.
When I compile your cpp-program with my pretty old 5.4-version g++:
g++ test.cpp -o test -Ofast -march=native -fwrapv
it becomes almost 3-times slower compared to the compilation without -fwrapv
. This is however a weakness of the optimizer: When inlining, it should see, that no signed-integer overflow is possible (all dimensions are about 256
), so the flag -fwrapv
shouldn't have any impact.
My old clang++
-version (3.8) seems to do a better job here: with the flags above I cannot see any degradation of the performance. I need to disable inlining via -fno-inline
to become a slower code but it is slower even without -fwrapv
i.e.:
clang++ test.cpp -o test -Ofast -march=native -fno-inline
So there is a systematical bias in favor of your c++-program: the optimizer can optimize the code for the known values after the inlining - something the cython can not do.
So we can see: clang++ was not able to optimize function diff
with arbitrary sizes but was able to optimize it for size=256. Cython however, can only use the not optimized version of diff
. That is the reason, why -fno-wrapv
has no positive impact.
My take-away from it: disallow inlining of the function of interest (e.g. compile it in its own object file) in the cpp-tester to ensure a level ground with cython, otherwise one sees performance of a program which was specially optimized for this one input.
NB: A funny thing is, that if all int
s are replaced by unsigned int
s, then naturally -fwrapv
doesn't play any role, but the version with unsigned int
is as slow as int
-version with -fwrapv
, which is only logical, as there is no undefined behavior to be exploited.