I am writing an OpenMP code calling different BLAS kernels, mostly DGEMMs with different sizes, in different threads. To maximize performance I want to have control over the