I\'m working on a project with matrix multiplication. I have been able to write the C code and I was able to generate the assembly code for it using the Microsoft visual studio
Visual Studio and SSE is a red herring here (as well as the C++ vs. C nonsense). Assuming you compile in Release mode there are other reason your code is inefficient especially for large matrices. The main reason is that it's cache unfriendly. To make your code efficient for an arbitrary n*n matrix you need optimize for big and small.
It's important to optimize for the cache BEFORE employing SIMD or threads. In the code below I use block multiplication to speed up your code for a 1024x1204 matrix by more than a factor of ten (7.1 s with old code and 0.6s with new) using only a single thread without using SSE/AVX. It's not going to do any good to use SIMD if your code is memory bound.
I have already described a first order improvement to matrix multiplication using the transpose here. OpenMP C++ Matrix Multiplication run slower in parallel
But let me describe an even more cache friendly method. Let's assume your hardware has two types of memory:
In reality, modern CPUs actually have several levels of this (L1 small and fast, L2 larger and slower, L3 even larger and slower, main memory even larger still and even slower still. Some CPUs even have a L4) but this simple model with only two levels here will still lead to a big improvement in performance.
Using this model with two types of memory you can show that you will get the best performance by dividing your matrix into square tiles which fit in the small and fast memory and doing block matrix multiplication. Next you want to rearrange the memory so that the elements of each tile are contiguous.
Below is some code showing how to do this. I used a block size of 64x64 on a 1024x1024 matrix. It took 7s with your code and 0.65s with mine. The matrix size has to be multiples of 64x64 but it's easy to extend this to an arbitrary size matrix. If you want to see an example of how to optimize the blocks see this Difference in performance between MSVC and GCC for highly optimized matrix multplication code
#include
#include
#include
#include
void reorder(float *a, float *b, int n, int bs) {
int nb = n/bs;
int cnt = 0;
for(int i=0; i