问题
I have a routine that performs a few MKL calls on small matrices (50-100 x 1000 elements) to fit a model, which I then call for different models. In pseudo-code:
double doModelFit(int model, ...) {
...
while( !done ) {
cblas_dgemm(...);
cblas_dgemm(...);
...
dgesv(...);
...
}
return result;
}
int main(int argc, char **argv) {
...
c_start = 1; c_stop = nmodel;
for(int c=c_start; c<c_stop; c++) {
...
result = doModelFit(c, ...);
...
}
}
Call the above version 1. Since the models are independent, I can use OpenMP threads to parallelize the model fitting, as follows (version 2):
int main(int argc, char **argv) {
...
int numthreads=omp_max_num_threads();
int c;
#pragma omp parallel for private(c)
for(int t=0; t<numthreads; t++) {
// assuming nmodel divisible by numthreads...
c_start = t*nmodel/numthreads+1;
c_end = (t+1)*nmodel/numthreads;
for(c=c_start; c<c_stop; c++) {
...
result = doModelFit(c, ...);
...
}
}
}
When I run version 1 on the host machine, it takes ~11 seconds and VTune reports poor parallelization with most of the time spent idle. Version 2 on the host machine takes ~5 seconds and VTune reports great parallelization (near 100% of the time is spent with 8 CPUs in use). Now, when I compile the code to run on the Phi card in native mode (with -mmic), versions 1 and 2 both take approximately 30 seconds when run on the command prompt on mic0. When I use VTune to profile it:
- Version 1 takes the same roughly 30 seconds, and the hotspot analysis shows that most time is spent in __kmp_wait_sleep and __kmp_static_yield. Out of 7710s CPU time, 5804s are spent in Spin Time.
- Version 2 takes fooooorrrreevvvver... I kill it after running a couple minutes in VTune. The hotspot analysis shows that of 25254s of CPU time, 21585s are spent in [vmlinux].
Can anyone shed some light on what's going on here and why I'm getting such bad performance? I'm using the default for OMP_NUM_THREADS and set KMP_AFFINITY=compact,granularity=fine (as recommended by Intel). I'm new to MKL and OpenMP, so I'm certain I'm making rookie mistakes.
Thanks, Andrew
回答1:
The most probable reason for this behavior given that most of the time is spent in OS (vmlinux), is over-subscription caused by nested OpenMP parallel region inside MKL implementation of cblas_dgemm()
and dgesv
. E.g. see this example.
This version is supported and explained by Jim Dempsey at the Intel forum.
回答2:
What about using MKL:sequential library? If you link MKL library with sequential option, it doesn't generate OpenMP threads inside of the MKL itself. I guess you may get better results than now.
来源:https://stackoverflow.com/questions/19734145/mkl-performance-on-intel-phi