问题
I am using mkl
cblas_dgemm
and currently have it with CblasRowMajor
, CblasNoTrans
, CblasNotrans
, for my matrices.
I know that c
is a row major language, whereas dgemm
is a column major algorithm. I am interested to know if switching the ordering of the matrices will have any affect on the cblas_dgemm
algorithm if I am linking against mkl
. Is mkl
smart enough to do things behind the scenes that I would try to do to optimized the matrix multiplcations? If not, what is the best way to perform matrix multiplications with mkl
?
回答1:
TL;DR: In short it does not matter whether you perform matrix-matrix multiplications using row-major or column-major ordering with MKL (and other BLAS implementations).
I know that c is a row major language, whereas dgemm is a column major algorithm.
DGEMM is not a column-major algorithm, it is the BLAS interface for computing a matrix-matrix product with general matrices. The common reference implementation for DGEMM (and most of BLAS) is Netlib's which is written in Fortran. The only reason it assumes column-major ordering is because Fortran is a column-major order language. DGEMM (and the corresponding BLAS Level 3 functions) is not specifically for column-major data.
What Does DGEMM Compute?
DGEMM in basic maths performs a 2D matrix-matrix multiplication. The standard algorithm for multiplying 2D matrices requires you to traverse one matrix along its rows and the other along its columns. To perform the matrix-matrix multiplication, AB = C, we would multiply the rows of A by the columns of B to produce C. Therefore, the ordering of the input matrices does not matter as one matrix must be traversed along its rows and the other along its columns.
Investigating Row-Major and Column-Major DGEMM Computation with MKL
Intel MKL is smart enough to utilise this under the hood and provides the exact same performance for row-major and column-major data.
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, ...);
and
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, ...);
will execute with similar performance. We can test this with a relatively simple program
#include <float.h>
#include <mkl.h>
#include <omp.h>
#include <stdio.h>
void init_matrix(double *A, int n, int m, double d);
void test_dgemm(CBLAS_LAYOUT Layout, double *A, double *B, double *C, const MKL_INT m, const MKL_INT n, const MKL_INT k, int nSamples, double *timing);
void print_summary(const MKL_INT m, const MKL_INT n, const MKL_INT k, const int nSamples, const double *timing);
int main(int argc, char **argv) {
MKL_INT n, k, m;
double *a, *b, *c;
double *timing;
int nSamples = 1;
if (argc != 5){
fprintf(stderr, "Error: Wrong number of arguments!\n");
fprintf(stderr, "usage: %s mMatrix nMatrix kMatrix NSamples\n", argv[0]);
return -1;
}
m = atoi(argv[1]);
n = atoi(argv[2]);
k = atoi(argv[3]);
nSamples = atoi(argv[4]);
timing = malloc(nSamples * sizeof *timing);
a = mkl_malloc(m*k * sizeof *a, 64);
b = mkl_malloc(k*n * sizeof *a, 64);
c = mkl_calloc(m*n, sizeof *a, 64);
/** ROW-MAJOR ORDERING **/
test_dgemm(CblasRowMajor, a, b, c, m, n, k, nSamples, timing);
/** COLUMN-MAJOR ORDERING **/
test_dgemm(CblasColMajor, a, b, c, m, n, k, nSamples, timing);
mkl_free(a);
mkl_free(b);
mkl_free(c);
free(timing);
}
void init_matrix(double *A, int n, int m, double d) {
int i, j;
#pragma omp for schedule (static) private(i,j)
for (i = 0; i < n; ++i) {
for (j = 0; j < m; ++j) {
A[j + i*n] = d * (double) ((i - j) / n);
}
}
}
void test_dgemm(CBLAS_LAYOUT Layout, double *A, double *B, double *C, const MKL_INT m, const MKL_INT n, const MKL_INT k, int nSamples, double *timing) {
int i;
MKL_INT lda = m, ldb = k, ldc = m;
double alpha = 1.0, beta = 0.0;
if (CblasRowMajor == Layout) {
printf("\n*****ROW-MAJOR ORDERING*****\n\n");
} else if (CblasColMajor == Layout) {
printf("\n*****COLUMN-MAJOR ORDERING*****\n\n");
}
init_matrix(A, m, k, 0.5);
init_matrix(B, k, n, 0.75);
init_matrix(C, m, n, 0);
// First call performs any buffer/thread initialisation
cblas_dgemm(Layout, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
double tmin = DBL_MAX, tmax = 0.0;
for (i = 0; i < nSamples; ++i) {
init_matrix(A, m, k, 0.5);
init_matrix(B, k, n, 0.75);
init_matrix(C, m, n, 0);
timing[i] = dsecnd();
cblas_dgemm(Layout, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
timing[i] = dsecnd() - timing[i];
if (tmin > timing[i]) tmin = timing[i];
else if (tmax < timing[i]) tmax = timing[i];
}
print_summary(m, n, k, nSamples, timing);
}
void print_summary(const MKL_INT m, const MKL_INT n, const MKL_INT k, const int nSamples, const double *timing) {
int i;
double tavg = 0.0;
for(i = 0; i < nSamples; i++) {
tavg += timing[i];
}
tavg /= nSamples;
printf("#Loop | Sizes m n k | Time (s)\n");
for(i = 0; i < nSamples; i++) {
printf("%4d %12d %3d %3d %6.4f\n", i + 1 , m, n, k, timing[i]);
}
printf("Summary:\n");
printf("Sizes m n k | Avg. Time (s)\n");
printf(" %8d %3d %3d %12.8f\n", m, n, k, tavg);
}
on my system this produces
$ ./benchmark_dgemm 1000 1000 1000 5
*****ROW-MAJOR ORDERING*****
#Loop | Sizes m n k | Time (s)
1 1000 1000 1000 0.0589
2 1000 1000 1000 0.0596
3 1000 1000 1000 0.0603
4 1000 1000 1000 0.0626
5 1000 1000 1000 0.0584
Summary:
Sizes m n k | Avg. Time (s)
1000 1000 1000 0.05995692
*****COLUMN-MAJOR ORDERING*****
#Loop | Sizes m n k | Time (s)
1 1000 1000 1000 0.0597
2 1000 1000 1000 0.0610
3 1000 1000 1000 0.0581
4 1000 1000 1000 0.0594
5 1000 1000 1000 0.0596
Summary:
Sizes m n k | Avg. Time (s)
1000 1000 1000 0.05955171
where we can see that there is very little difference between the column-major ordering time and the row-major ordering time. 0.0595 seconds for column-major versus 0.0599 seconds for row-major. Executing this again might produce the following, where the row-major calculation is faster by 0.00003 seconds.
$ ./benchmark_dgemm 1000 1000 1000 5
*****ROW-MAJOR ORDERING*****
#Loop | Sizes m n k | Time (s)
1 1000 1000 1000 0.0674
2 1000 1000 1000 0.0598
3 1000 1000 1000 0.0595
4 1000 1000 1000 0.0587
5 1000 1000 1000 0.0584
Summary:
Sizes m n k | Avg. Time (s)
1000 1000 1000 0.06075310
*****COLUMN-MAJOR ORDERING*****
#Loop | Sizes m n k | Time (s)
1 1000 1000 1000 0.0634
2 1000 1000 1000 0.0596
3 1000 1000 1000 0.0582
4 1000 1000 1000 0.0582
5 1000 1000 1000 0.0645
Summary:
Sizes m n k | Avg. Time (s)
1000 1000 1000 0.06078266
来源:https://stackoverflow.com/questions/32445741/does-mkl-optimize-cblas-for-major-order