I believe I am experiencing false sharing using OpenMP. Is there any way to identify it and fix it?
My code is: https://github.com/wchan/libNN/blob/master/ResilientBackpropagation.hpp line 36.
Using a 4 core CPU compared to the single threaded 1 core version yielded only 10% in additional performance. When using a NUMA 32 physical (64 virtual) CPU system, the CPU utilization is stuck at around 1.5 cores, I think this is a direct symptom of false sharing and unable to scale.
I also tried running it with Intel VTune profiler, it stated most of the time is spent on the "f()" and "+=" functions. I believe this is reasonable and doesn't really explain why I am getting such poor scaling...
Any ideas/suggestions?
Thanks.
Use reduction instead of explicitly indexing an array based on the thread ID. That array virtually guarantees false sharing.
i.e. replace this
#pragma omp parallel for clones[omp_get_thread_num()]->mse() += norm_2(dedy); for (int i = 0; i < omp_get_max_threads(); i++) { neural_network->mse() += clones[i]->mse();
with this:
#pragma omp parallel for reduction(+ : mse)
mse += norm_2(dedy);
neural_network->mse() = mse;
One way of knowing for sure is looking at cache statistics with a tool like cachegrind :
valgrind --tool=cachegrind [command]
来源:https://stackoverflow.com/questions/9027653/openmp-false-sharing