I\'m trying to compare different methods for matrix multiplication. The first one is normal method:
do
{
for (j = 0; j < i; j++)
{
for (k
ATTENTION: You have a BUG in your second implementation
for (f = 0; f < i; f++) {
for (co = 0; co < i; co++) {
MatrixB[f][co] = MatrixB[co][f];
}
}
When you do f=0, c=1
MatrixB[0][1] = MatrixB[1][0];
you overwrite MatrixB[0][1]
and lose that value! When the loop gets to f=1, c=0
MatrixB[1][0] = MatrixB[0][1];
the value copied is the same that was already there.
If the matrix is not large enough or you don't repeat the operations a high number of times you won't see appreciable differences.
If the matrix is, say, 1,000x1,000 you will begin to see improvements, but I would say that if it is below 100x100 you should not worry about it.
Also, any 'improvement' may be of the order of milliseconds, unless yoy are either working with extremely large matrices or repeating the operation thousands of times.
Finally, if you change the computer you are using for a faster one the differences will be even narrower!
You should not write matrix multiplication. You should depend on external libraries. In particular you should use the GEMM
routine from the BLAS
library. GEMM often provides the following optimizations
Blocking
Efficient Matrix Multiplication relies on blocking your matrix and performing several smaller blocked multiplies. Ideally the size of each block is chosen to fit nicely into cache greatly improving performance.
Tuning
The ideal block size depends on the underlying memory hierarchy (how big is the cache?). As a result libraries should be tuned and compiled for each specific machine. This is done, among others, by the ATLAS
implementation of BLAS
.
Assembly Level Optimization
Matrix multiplicaiton is so common that developers will optimize it by hand. In particular this is done in GotoBLAS
.
Heterogeneous(GPU) Computing
Matrix Multiply is very FLOP/compute intensive, making it an ideal candidate to be run on GPUs. cuBLAS
and MAGMA
are good candidates for this.
In short, dense linear algebra is a well studied topic. People devote their lives to the improvement of these algorithms. You should use their work; it will make them happy.
Just something for you to try (but this would only make a difference for large matrices): seperate out your addition logic from the multiplication logic in the inner loop like so:
for (k = 0; k < i; k++)
{
int sums[i];//I know this size declaration is illegal in C. consider
//this pseudo-code.
for (l = 0; l < i; l++)
sums[l] = MatrixA[j][l]*MatrixB[k][l];
int suma = 0;
for(int s = 0; s < i; s++)
suma += sums[s];
}
This is because you end up stalling your pipeline when you write to suma. Granted, much of this is taken care of in register renaming and the like, but with my limited understanding of hardware, if I wanted to squeeze every ounce of performance out of the code, I would do this because now you don't have to stall the pipeline to wait for a write to suma. Since multiplication is more expensive than addition, you want to let the machine paralleliz it as much as possible, so saving your stalls for the addition means you spend less time waiting in the addition loop than you would in the multiplication loop.
This is just my logic. Others with more knowledge in the area may disagree.
If you are working on small numbers, then the improvement you are mentioning is negligible. Also, performance will vary depend on the Hardware on which you are running. But if you are working on numbers in millions, then it will effect. Coming to the program, can you paste the program you have written.
What Every Programmer Should Know About Memory (pdf link) by Ulrich Drepper has a lot of good ideas about memory efficiency, but in particular, he uses matrix multiplication as an example of how knowing about memory and using that knowledge can speed this process. Look at appendix A.1 in his paper, and read through section 6.2.1. Table 6.2 in the paper shows that he could get his running time to be 10% from a naive implementation's time for a 1000x1000 matrix.
Granted, his final code is pretty hairy and uses a lot of system-specific stuff and compile-time tuning, but still, if you really need speed, reading that paper and reading his implementation is definitely worth it.