I am making some benchmarks with CUDA, C++, C#, Java, and using MATLAB for verification and matrix generation. When I perform matrix multiplication with MATLAB, 2048x
It's slow in C++ because you are not using multithreading. Essentially, if A = B C, where they are all matrices, the first row of A can be computed independently from the 2nd row, etc. If A, B, and C are all n by n matrices, you can speed up the multiplication by a factor of n^2, as
a_{i,j} = sum_{k} b_{i,k} c_{k,j}
If you use, say, Eigen [ http://eigen.tuxfamily.org/dox/GettingStarted.html ], multithreading is built-in and the number of threads is adjustable.