I started a similar question on another thread, but then I was focusing on how to use OpenCV. Having failed to achieve what I originally wanted, I will ask here exactly what I w
One thing that is definitely hurting you in your C++ code is that it has a boatload of char to int conversions. By boatload, I mean up to 2*2782*4000*128 char to int conversions. Those char
to int
conversions are slow, very slow.
You can reduce this to (2782+4000)*128 such conversions by allocating a pair of int
arrays, one 2782*128 and the other 4000*128, to contain the cast-to-integer contents of your char* a
and char* b
arrays. Work with these int*
arrays rather than your char*
arrays.
Another problem might be your use of int
versus long
. I don't work on windows, so this might not be applicable. On the machines I work on, int
is 32 bits and long
is now 64 bits. 32 bits is more than enough because 255*255*128 < 256*256*128 = 223.
That obviously isn't the problem.
What's striking is that the code in question is not calculating that huge 2728 by 4000 array that the Matlab code is creating. What's even more striking is that Matlab is most likely doing this with doubles rather than ints -- and it's still beating the pants off the C/C++ code.
One big problem is cache. That 4000*128 array is far too big for level 1 cache, and you are iterating over that big array 2782 times. Your code is doing far too much waiting on memory. To overcome this problem, work with smaller chunks of the b
array so that your code works with level 1 cache for as long as possible.
Another problem is the optimization if (distance>min_distance) break;
. I suspect that this is actually a dis-optimization. Having if
tests inside your innermost loop is oftentimes a bad idea. Blast through that inner product as fast as possible. Other than wasted computations, there is no harm in getting rid of this test. Sometimes it is better to make apparently unneeded computations if doing so can remove a branch in an innermost loop. This is one of those cases. You might be able to solve your problem just by eliminating this test. Try doing that.
Getting back to the cache problem, you need to get rid of this branch so that you can split the operations over the a
and b
matrix into smaller chunks, chunks of no more than 256 rows at a time. That's how many rows of 128 unsigned chars fit into one of the two modern Intel chip's L1 caches. Since 250 divides 4000, look into logically splitting that b
matrix into 16 chunks. You may well want to form that big 2872 by 4000 array of inner products, but do so in small chunks. You can add that if (distance>min_distance) break;
back in, but do so at a chunk level rather than at the byte by byte level.
You should be able to beat Matlab because it almost certainly is working with doubles, but you can work with unsigned chars and ints.
As you observed, your code is dominated by the matrix product that represents about 2.8e9 arithmetic operations. Yopu say that Matlab (or rather the highly optimized MKL) computes it in about 0.05s. This represents a rate of 57 GFLOPS showing that it is not only using vectorization but also multi-threading. With Eigen, you can enable multi-threading by compiling with OpenMP enabled (-fopenmp
with gcc). On my 5 years old computer (2.66Ghz Core2), using floats and 4 threads, your product takes about 0.053s, and 0.16s without OpenMP, so there must be something wrong with your compilation flags. To summary, to get the best of Eigen:
OMP_NUM_THREADS
environment variable to the number of physical cores (this is very important, otherwise the performance will be very bad!)OMP_NUM_THREADS
to nb_cores-1
Matrix multiply generally uses the worst possible cache access pattern for one of the two matrices, and the solution is to transpose one of the matrices and use a specialized multiply algorithm that works on data stored that way.
Your matrix already IS stored transposed. By transposing it into the normal order and then using a normal matrix multiply, your are absolutely killing performance.
Write your own matrix multiply loop that inverts the order of indices to the second matrix (which has the effect of transposing it, without actually moving anything around and breaking cache behavior). And pass your compiler whatever options it has for enabling auto-vectorization.