I'm only asking this to try to understand what I've spent 24 hours trying to fix.
My system: Ubuntu 12.04.2, Matlab R2011a, both of them 64-bit, Intel Xeon processor based on Nehalem.
The problem is simply, Matlab allows OpenMP based programs to utilize all CPU cores with hyper-threading enabled but does not allow the same for TBB.
When running TBB, I can launch only 4 threads, even when I change the maxNumCompThreads to 8. While with OpenMP I can use all the threads I want. Without Hyper-threading, both TBB and OpenMP utilize all 4 cores of course.
I understand Hyper-threading and that its virtual, but the limitation matlab does, actually does cause a penalty on the performance (an extra reference).
I tested this issue using 2 programs, a simple for loop with
#pragma omp parallel for
and another very simple loop based on a tbb sample code.
tbb::task_scheduler_init init(tbb::task_scheduler_init::deferred);
tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());
and wrapped both of them with a matlab mexFunction.
Does any one have an explanation for this? Is there an inherent difference in the thread creation method or structure that allows matlab to throttle TBB but does not allow this throttoling for OpenMP?
Code for reference:
OpenMP:
#include "mex.h"
void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] ){
threadCount = 100000;
#pragma omp parallel for
for(int globalId = 0; globalId < threadCount ; globalId++)
{
for(long i=0;i<1000000000L;++i) {} // Deliberately run slow
}
}
TBB:
#include "tbb/parallel_for_each.h"
#include "tbb/task_scheduler_init.h"
#include <iostream>
#include <vector>
#include "mex.h"
struct mytask {
mytask(size_t n)
:_n(n)
{}
void operator()() {
for (long i=0;i<1000000000L;++i) {} // Deliberately run slow
std::cerr << "[" << _n << "]";
}
size_t _n;
};
template <typename T> struct invoker {
void operator()(T& it) const {it();}
};
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const
mxArray* prhs[]) {
tbb::task_scheduler_init init(tbb::task_scheduler_init::deferred); // Automatic number of threads
std::vector<mytask> tasks;
for (int i=0;i<10000;++i)
tasks.push_back(mytask(i));
tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());
}
Sorry it took so long to answer. Specifying deferred
just keeps the task scheduler from creating the thread pool until the first parallel construct starts. By default, the number of threads is automatic
, which corresponds to the number of cores (the code setting this is in src/tbb/tbb_misc_ex.cpp
, and also depends on CPU affinity among other things. See initialize_hardware_concurrency_info()
)
I modified your code slightly:
#include "tbb/parallel_for_each.h"
#include "tbb/task_scheduler_init.h"
#include "tbb/atomic.h"
#include "tbb/spin_mutex.h"
#include <iostream>
#include <vector>
// If LOW_THREAD == 0, run with task_scheduler_init(automatic), which is the number
// of cores available. If 1, start with 1 thread.
#ifndef NTASKS
#define NTASKS 50
#endif
#ifndef MAXWORK
#define MAXWORK 400000000L
#endif
#ifndef LOW_THREAD
#define LOW_THREAD 0 // 0 == automatic
#endif
tbb::atomic<size_t> cur_par;
tbb::atomic<size_t> max_par;
#if PRINT_OUTPUT
tbb::spin_mutex print_mutex;
#endif
struct mytask {
mytask(size_t n) :_n(n) {}
void operator()() {
size_t my_par = ++cur_par;
size_t my_old = max_par;
while( my_old < cur_par) { my_old = max_par.compare_and_swap(my_par, my_old); }
for (long i=0;i<MAXWORK;++i) {} // Deliberately run slow
#if PRINT_OUTPUT
{
tbb::spin_mutex::scoped_lock s(print_mutex);
std::cerr << "[" << _n << "]";
}
#endif
--cur_par;
}
size_t _n;
};
template <typename T> struct invoker {
void operator()(T& it) const {it();}
};
void mexFunction(/*int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[]*/) {
for( size_t thr = LOW_THREAD; thr <= 128; thr = thr ? thr * 2: 1) {
cur_par = max_par = 0;
tbb::task_scheduler_init init(thr == 0 ? (unsigned int)tbb::task_scheduler_init::automatic : thr);
std::vector<mytask> tasks;
for (int i=0;i<NTASKS;++i) tasks.push_back(mytask(i));
tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());
std::cout << " for thr == ";
if(thr) std::cout << thr; else std::cout << "automatic";
std::cout << ", maximum parallelism == " << (size_t)max_par << std::endl;
}
}
int main() {
mexFunction();
}
I ran this on a 16-core system here:
for thr == automatic, maximum parallelism == 16 for thr == 1, maximum parallelism == 1 for thr == 2, maximum parallelism == 2 for thr == 4, maximum parallelism == 4 for thr == 8, maximum parallelism == 8 for thr == 16, maximum parallelism == 16 for thr == 32, maximum parallelism == 32 for thr == 64, maximum parallelism == 50 for thr == 128, maximum parallelism == 50
The limit of 50 is the total number of tasks created by the program.
The threads created by TBB are shared by the parallel constructs started by the program, so if you have two parallel for_each running simultaneously, the maximum number of threads will not change; each for_each will run more-slowly. The TBB library does not control the number of threads used in OpenMP constructs, so an OpenMP parallel_for and a TBB parallel_for_each will generally oversubscribe the machine.
来源:https://stackoverflow.com/questions/17328127/matlab-limits-tbb-but-not-openmp