I started a similar question on another thread, but then I was focusing on how to use OpenCV. Having failed to achieve what I originally wanted, I will ask here exactly what I w
One thing that is definitely hurting you in your C++ code is that it has a boatload of char to int conversions. By boatload, I mean up to 2*2782*4000*128 char to int conversions. Those char
to int
conversions are slow, very slow.
You can reduce this to (2782+4000)*128 such conversions by allocating a pair of int
arrays, one 2782*128 and the other 4000*128, to contain the cast-to-integer contents of your char* a
and char* b
arrays. Work with these int*
arrays rather than your char*
arrays.
Another problem might be your use of int
versus long
. I don't work on windows, so this might not be applicable. On the machines I work on, int
is 32 bits and long
is now 64 bits. 32 bits is more than enough because 255*255*128 < 256*256*128 = 223.
That obviously isn't the problem.
What's striking is that the code in question is not calculating that huge 2728 by 4000 array that the Matlab code is creating. What's even more striking is that Matlab is most likely doing this with doubles rather than ints -- and it's still beating the pants off the C/C++ code.
One big problem is cache. That 4000*128 array is far too big for level 1 cache, and you are iterating over that big array 2782 times. Your code is doing far too much waiting on memory. To overcome this problem, work with smaller chunks of the b
array so that your code works with level 1 cache for as long as possible.
Another problem is the optimization if (distance>min_distance) break;
. I suspect that this is actually a dis-optimization. Having if
tests inside your innermost loop is oftentimes a bad idea. Blast through that inner product as fast as possible. Other than wasted computations, there is no harm in getting rid of this test. Sometimes it is better to make apparently unneeded computations if doing so can remove a branch in an innermost loop. This is one of those cases. You might be able to solve your problem just by eliminating this test. Try doing that.
Getting back to the cache problem, you need to get rid of this branch so that you can split the operations over the a
and b
matrix into smaller chunks, chunks of no more than 256 rows at a time. That's how many rows of 128 unsigned chars fit into one of the two modern Intel chip's L1 caches. Since 250 divides 4000, look into logically splitting that b
matrix into 16 chunks. You may well want to form that big 2872 by 4000 array of inner products, but do so in small chunks. You can add that if (distance>min_distance) break;
back in, but do so at a chunk level rather than at the byte by byte level.
You should be able to beat Matlab because it almost certainly is working with doubles, but you can work with unsigned chars and ints.